Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.
Previously, we discussed SLOs, which are concrete metrics you use to hold yourself accountable to your reliability goal posts.
This time, we drill inwards towards how to improve reliability. This essay will focus on definitions, and later essays will introduce examples about those definitions.
Let's start zoomed out. Your systems enter an unhealthy state when:
Mechanisms that you can introduce to prevent and resolve incidents are your Incident Response Building Blocks.
A Protection
A Verification
Remediation
We will dig into each of those in later posts. To help you tie together the definitions, here is an example incident and how it could be influenced by the incident response building blocks.
assert not IS_PRODUCTION at the start of the unit test command.assert 1G < database_size < 8G, else sends SMS notification to human operator.The potential protection stops the incident from occurring. The potential verification reduces the delay for the human operator to arrive. The potential remediation is how the incident could be resolved.
That's a good place to stop for this essay. Over the next few essays, we will discuss each incident response building block with nuances and examples. Have a great day!
That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.
This essay was published 2020-05-31.