Skip to main content
    View all posts

    What Hundreds of Incidents Taught Me About Response

    Dylan
    6 min read

    Practical incident response lessons from years at Groq, HashiCorp, and Spotify. What actually works when systems fail.

    SRE
    Incident Management

    After managing incidents at Groq, HashiCorp, and Spotify, I have found that certain patterns start to repeat. The systems differ, the symptoms change, and the stakes vary, but the dynamics often rhyme. If I were briefing my day-one self, I would spend less time worrying about exotic failure modes and more time building fluency in the situations that seem to show up every week.

    In my experience, most incidents resolve in a fairly familiar way, and the rare, cinematic outages are not where teams spend most of their time. I have seen the bulk of incidents fall into three buckets: a configuration change that broke something adjacent, a demand change where traffic or usage patterns shifted unexpectedly, or a dependency change where an upstream service degraded, changed behavior, or went down. The cascading, multi-system failures that make great conference stories do happen, but they have felt like the exception rather than the rule. Because of that, I have found it useful to shape playbooks, tooling, and muscle memory around those three categories first.

    I have also learned, sometimes the hard way, that the first five minutes set the tone, and speed without clarity can create more damage than the original issue. I have pushed a "resolved" update the moment metrics looked better and then realized the fix was not stable. The painful part is rarely the wobble itself; it is the whiplash. Moving from "resolved" back to "investigating" tends to burn trust quickly because it signals that you do not actually have a handle on what is happening. What has worked better for me is waiting to confirm stability for ten to fifteen minutes before declaring resolution, and watching the system through at least one full cycle of whatever matters most for that service, whether that is batch processing, cache warmup, or regional traffic rebalance. The pressure to close quickly is real, but in my experience, reopening is worse.

    Over time, I have come to think of communication as part of mitigation, not a side task. In my experience, customers tolerate downtime better than silence, and a consistent cadence changes how an incident feels even when the underlying facts have not changed. For high-severity incidents, a cadence that has worked well for me is external updates every thirty minutes and internal stakeholder updates every fifteen. For lower severities, I have found it can be fine to stretch to sixty minutes or longer, as long as the expectation is set and then met consistently. Missing your own cadence often seems worse than choosing a longer interval because it signals drift and disorganization. The content itself does not need to be revelatory. "We have identified the affected component and are testing a fix" is usually enough, while long silence tends to make people assume you forgot or lost control.

    Recommended Update Cadence
    15min
    External Updates
    10min
    Internal Updates

    Critical outage affecting all users. Maximum urgency.

    I have also become more skeptical of the default "war room" shape. Sometimes it helps, but in my experience it can easily turn into ten people watching one person type, which does not parallelize work and can slow decisions. A structure I have preferred is an Incident Commander who focuses on coordination rather than investigation, an AI scribe or automated tooling to capture the timeline, and a small set of subject-matter experts who can actually make changes safely. Leadership involvement is a nuance I have had to learn to manage. Executives often want to help, but their presence can change the room, and the people debugging usually need psychological safety to say "I do not know," test hypotheses, and back out of dead ends. When leaders need updates, I have found it works better to use a side channel. When their presence becomes disruptive, a separate leadership bridge has been a practical way to protect focus without cutting anyone out.

    Runbooks have helped me many times, but I do not think they stay useful automatically. In my experience, runbooks written after an incident tend to solve that incident, and then they start aging as systems evolve. What has worked better for me is keeping troubleshooting context as close to alert definitions as possible, so that when the alert fires the relevant guidance is right there rather than buried in a wiki. The other part is a feedback loop. When an alert pages on-call, I have found it valuable to treat validation of the linked guidance as part of resolution. If the steps were wrong, incomplete, or misleading, updating them immediately tends to keep documentation honest because it is refreshed while the details are still clear.

    During incidents, I have seen teams reach for MTTR as if it can steer the moment, but I have come to think of it as primarily a lagging indicator. In the middle of an incident, the metric that has felt most useful is customer impact, specifically who is impacted and to what degree. "The API is returning 500s" describes a symptom, but it does not tell you whether this affects one percent of users or everyone, whether it is isolated to a region, or whether it blocks payments versus slowing a dashboard. In my experience, those distinctions shape severity, messaging, and prioritization, and they prevent teams from optimizing the wrong thing under pressure. I have found it pays to build dashboards that answer "who is hurting right now?" before you need them, because during the incident is a terrible time to invent the query that defines impact.

    Finally, I do not think every incident deserves the same retrospective weight. It is tempting to mandate a full retrospective for everything in the name of learning, but in my experience that can quickly turn into fatigue, and fatigue produces shallow writeups that nobody trusts. The tenth retrospective about a configuration typo rarely teaches something new. What has worked better for me is asking the responders. They were there, they usually know whether the failure was novel or familiar, and they can tell you whether the right output is a full, blameless retrospective with stakeholders and action items, a lightweight writeup, or a quick note plus a one-line fix. Sizing the retrospective effort intentionally has helped preserve energy for the incidents that actually change how you understand the system.

    Comments

    Comments will load when you scroll down...