davidmah.com

SRE 02: The Incident Response Pipeline

This is part 2 of
Site Reliability Engineering Birds Eye View

Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the first essay, please do because we'll continue to use vocabulary and concepts defined in it.

In the previous essay, I introduced the three reliability dimensions: latency, availability, and durability, and briefly described how to drive down risk on each of them.

This time, we will take a deeper look at how to dissect the abstraction of these dimensions. By being able to break it down, you will be able to identify issues and prioritize solutions at the appropriate granular level.

To start, let's take a look at the Incident Response Pipeline.

  number of minutes in a month
X percent chance in any particular minute of a relevant system scenario
X percent chance that protections fail to prevent an incident
X (
      number of minutes before detection by verifications
    + number of minutes before remediation is completed
  )
X monetary cost to business given this type of incident.
----------
= monetary cost per month of reliability issues

Your systems are stable and healthy most of the time. If that were perpetually the case, you would have no reliability concerns. There would be no need to discuss Site Reliability. But your systems aren't always reliable. For this reason, our discussion centers around moments when your systems are unstable and unhealthy.

What are those moments of failure? Analogous to the equation -- a system scenario occurs, which might trigger a reliability incident, and that incident persists until it is detected and resolved. The type and duration of the incident yields short-term and long-term business costs which can be conceptualized as a monetary cost. When you sum up those costs over time, you have a tally of how much non-reliability is costing your business.

Now that we've talked through it, we can simplify it using condensed vocabulary

  Time
X Scenarios potentially yielding Incidents
X Lack of Protections
X (
      Lack of Verifications
    + Lack of Remediations
  )
X Business Cost
----------
= Cost of Non-reliability

To reduce the cost of non-reliability, you can influence any component of the equation. Here are a few examples —

Protections: Disable unit test code when not in production. Whereas someone previously could accidentally run unit tests in production and wipe the database, incidents like these are no longer possible.
Remediations: Implement operational tools to help diagnose incidents twice as fast. Incidents now last half as long.
Cost: Select customers who don't change their purchasing decisions based on website speed. Latency is now irrelevant.

Take on only those reliability projects that will significantly improve one or more steps in the pipeline. Tackling a mathematically insignificant component of this pipeline is a misuse of your engineering staff.

That's a good place to stop for this essay. In the next essay, we will discuss where to set the goal posts for your reliability goals. Moving forward without contextualizing reliability in terms of your business is akin to flying blind. Have a great day!

Site Reliability Engineering Birds Eye View

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-05-31.