Skip to main content
    View all posts

    The SLO Math Most Teams Get Wrong

    Dylan
    7 min read

    More nines sounds possible until you do the pager math. Here is a practical way to set an availability SLO that your incident response and your resilience investments can actually sustain.

    SRE

    "Why are we below our availability target again?"

    I have heard this line in reliability reviews at Spotify, HashiCorp, and, more recently, at Groq. Someone drops 99.99% onto a slide because it signals confidence, and then reality shows up: the on-call engineer takes 15 minutes to acknowledge the page, another 20 to figure out which subsystem is failing, and another 30 to restore service. That is over an hour of user impact from a single incident, even when the team is doing solid work.

    Now compare that with the error budget. At 99.99% availability, the monthly budget is measured in minutes: about 4.03 minutes in a 28-day window, 4.32 minutes in a 30-day window, and 4.46 minutes in a 31-day window. One real incident can wipe out the month, and doing that math half-awake at 3 a.m. never gets friendlier. If you want to skip ahead, try the SLO Calculator and plug in your numbers.

    Most 99.99% SLOs fail for the same reason: teams treat the SLO as a statement of intent rather than a statement of capability. The pattern is predictable. A team picks the most optimistic number in the room, converts it into an error budget, assumes they can operate inside that budget, misses the target anyway, and then slowly stops trusting the framework. When misses become routine, people start debating whether downtime "should count," and the SLO stops being a decision tool.

    Resilience is capability. It is the system's ability to absorb disruption, limit user impact, restore service quickly, and learn so the same failure mode becomes less likely over time. An SLO is only credible if it reflects that reality. If it does not, it becomes a monthly reminder that the number was never connected to how the service behaves under stress.

    For time-based availability SLOs, the ceiling usually comes down to incident frequency, time to restore service, and blast radius. Many teams model only frequency and restoration time while implicitly assuming every incident affects 100% of users. Sometimes that is true, but when it is not, the assumption hides some of the highest-leverage resilience work you can do.

    Start with a simple model. Let W be the minutes in your measurement window, and for a rolling 30-day window, W is 43,200. If each incident effectively takes the whole service down, your downtime is incidents multiplied by mean time to restore (MTTR), and your achievable SLO is one minus that downtime divided by W. Put numbers on it: if you have two incidents per month and the average MTTR is 45 minutes, you are looking at 90 minutes of downtime, which works out to about 99.79% achievable availability over a 30-day window. That team can credibly target something like 99.7% or 99.75%, and a 99.9% target is a commitment their current incident response cannot deliver.

    Blast radius changes the picture because not every incident hits every user. A database failover might cause brief errors for a fraction of traffic, and a bad deploy might degrade one region while others stay healthy. In that world, "effective downtime" is incidents multiplied by MTTR multiplied by average blast radius. If you have per-incident impact data, summing duration times impact fraction per incident is even better, especially when blast radius varies widely. For example, three incidents per month with 40-minute average MTTR and a 50% average blast radius produces about 60 minutes of effective downtime, which is roughly 99.86% achievable over 30 days. The incident count is higher, but investments in graceful degradation, isolation, and traffic shaping can still raise the ceiling.

    MTTR is not a single knob, and treating it like one is where teams waste time. It is the sum of phases: detection, acknowledgment, diagnosis, and remediation. A well-tuned team might detect in two minutes, acknowledge in five, diagnose in ten, and remediate in fifteen, landing around 32 minutes total. A team with gaps can easily stretch to 70 minutes, with slower alerting, delayed human response, unclear telemetry, and manual recovery steps. The incident count can be identical, but the achievable SLO is not.

    What Can You Achieve?
    45min
    MTTR
    99.79%
    achievable SLO
    Detection5m
    Acknowledgment10m
    Diagnosis15m
    Remediation15m
    Incidents per month2
    Blast radius100%
    Effective downtime = 2 × 45 min × 100% = 90 min
    Achievable SLO = (43,200 − 90) / 43,200 = 99.79%

    The relationship between incident tolerance and availability is linear, which is both the good news and the bad news. Halving MTTR produces the same availability gain as halving incident count, all else equal. That is why many teams see the fastest returns by tightening detection, improving on-call ergonomics, filling observability gaps, and making rollback or mitigation paths reliable. Prevention work, like better testing, capacity planning, dependency hardening, and game days, often pays off more slowly because it requires deeper engineering investment. Blast radius reduction can be the highest leverage of all, but it demands architectural commitment, and you rarely get it for free.

    If you want an honest SLO, the process is straightforward, but it requires discipline. Measure MTTR by phase, and if you lack data, use pessimistic estimates and commit to instrumenting the gaps. Count user-impacting incidents over the past three to six months and normalize to a monthly rate. Estimate blast radius, and if most incidents are full outages, assume 100%. Calculate the achievable SLO from those inputs, or use the calculator. Then set your target slightly below the computed ceiling, because you need slack for worse-than-average months. If the business needs a higher number, name the concrete improvement path in operational terms, such as cutting acknowledgment time from 12 minutes to five, rather than vague mandates like "improve reliability."

    Once the SLO reflects capability, error budgets become useful again because they support real trade-offs. At 99.8% monthly, you have 86.4 minutes of budget, which is enough to have adult conversations about risk. You can ask whether a change with meaningful blast radius risk fits inside the remaining budget, slow down when burn accelerates, and push velocity when you are consistently under budget. Error budgets only work when teams believe the target is achievable. A 99.99% objective that gets missed every month teaches everyone to ignore the framework entirely.

    The calculator is built to force capability-first thinking. "What can I achieve?" takes response times by phase and incident frequency and returns your honest ceiling. "Can I meet this SLO?" starts from a target and checks whether your current capabilities can support it. "Budget Burndown" lets you simulate incidents and watch the budget deplete over time. Start with the first tab, because it anchors the discussion in operational reality rather than aspiration.

    A few questions come up every time. At 99.9% monthly, your error budget is 43.2 minutes, so the incident tolerance is simply that budget divided by your MTTR. With 30-minute MTTR, you can afford about 1.4 typical incidents, and with 60-minute MTTR, you can afford about 0.7, which means one typical incident exceeds the budget. The distinction between SLO and SLA matters, too: an SLO is an internal target, while an SLA is a contractual commitment with consequences, and the SLO should generally be stricter so you have buffer. Finally, time-based SLOs are easier for operational planning, request-based SLOs are often more precise, and many teams use both depending on what they need to manage.

    Most teams cannot sustain 99.99% availability. The math does not work unless you have sub-minute detection with low false positives, coverage that eliminates long acknowledgment gaps or replaces them with automation, runbooks that nearly diagnose the issue for you, remediation that is automated or close to one-click, and architectures that routinely limit blast radius. That level of resilience takes sustained investment, and for many services, 99.9% is the honest target. Consistently hitting a realistic number builds trust faster than chronically missing an aspirational one. The goal is not the highest number; it is the truest one your resilience investments can defend.

    Comments

    Comments will load when you scroll down...