davidmah.com

SRE 05: Incident Response Building Blocks.

This is part 5 of
Site Reliability Engineering Birds Eye View

Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.

Previously, we discussed SLOs, which are concrete metrics you use to hold yourself accountable to your reliability goal posts.

This time, we drill inwards towards how to improve reliability. This essay will focus on definitions, and later essays will introduce examples about those definitions.

Let's start zoomed out. Your systems enter an unhealthy state when:

An incident provoking scenario occurs.
and Protections fail to prevent it from escalating into an incident.
and Verification has not yet brought attention to the incident.
or Remediation has not yet resolved the circumstances causing the replication of the incident.

Mechanisms that you can introduce to prevent and resolve incidents are your Incident Response Building Blocks.

A Protection

Is a guard rail that halts forward motion.
Or is a defensive algorithm that adjusts how to fulfill requests based on circumstances.

A Verification

Is a probe or measurement of the system, using the response to classify the presence of an incident.
- Is periodic or event-triggered.
- Is ideally but not necessarily fully automated.
In all cases, there is a mechanism for notification to a human or bot operator to ask for remediation.

Remediation

Is the workflow for resolving the incident, either by a human or bot.
It starts with a diagnosis of the circumstances maintaining the incident
This is followed by an adjustment to invalidate the circumstances maintaining the incident.
- Note: this may have no relation with the original incident provoking scenario!

We will dig into each of those in later posts. To help you tie together the definitions, here is an example incident and how it could be influenced by the incident response building blocks.

Scenario: Developer runs unit tests on production servers.
Incident: The unit test code drops all database tables.
Potential Protection: assert not IS_PRODUCTION at the start of the unit test command.
Potential Verification: Script runs every 60 seconds, assert 1G < database_size < 8G, else sends SMS notification to human operator.
Potential Remediation: Human operator diagnoses by logging into the database and checking for table sizes. The operator realizes the data is gone, and initiates a database replacement using a historical data snapshot.

The potential protection stops the incident from occurring. The potential verification reduces the delay for the human operator to arrive. The potential remediation is how the incident could be resolved.

That's a good place to stop for this essay. Over the next few essays, we will discuss each incident response building block with nuances and examples. Have a great day!

Site Reliability Engineering Birds Eye View

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-05-31.