Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.
Previously, we discussed setting alerts on leading indicators, which is a way to use alerts as a Protection as opposed to a Verification.
There's another re-purposing we can do with those alerts, which is to aid an operator in the diagnosis phase of a Remediation workflow.
Let's discuss Remediation, specifically the diagnosis half. We'll plug in the alert use-case after that.
Remediation begins when an operator first becomes aware of the presence of an urgent incident.
I'd like to highlight my intention behind the vocabulary, "circumstances maintaining the incident". You might have otherwise expected me to say "root cause", but that is not what I'm talking about.
Although an incident may have been the result of one chain of events, there are several factors that prolong the incident. Ideally, the operator identifies one factor to adjust to resolve the incident. This often can be done without the need to understand the original chain of events.
For instance, perhaps a new algorithm is non-performant, significantly increasing server CPU beyond a tipping point, and thus, the site goes down. As an adjustment, the operator can provision more servers to distribute the load. This resuscitates the site, and the operator never had to understand the source of the increased CPU utilization.
Although they don't need to understand the original chain of events, the operator does have to piece together why the incident continues to persist. To do this, when an operator begins diagnosis, they should start by inspecting the symptom. They should incrementally inspect underlying causes until they produce enough of an understanding to make the right adjustment.
Here is another example.
Symptom: Website is not responding.
Diagnostic actions:
The characteristic to take note of is how one breadcrumb leads to another, in an 'A to B to C' sort of way. Staying on this trail makes it easy to see the forward progress in piecing together the diagnosis.
An alternative to incremental diagnosis is pattern matching. Look through a variety of diagnostic information, in the hope that an answer will be exposed. This is valid and can be a fast way to get answers to simple incidents. But if this attempt to pattern match doesn't succeed, then continuing a diagnostic approach in this way will feel aimless and endless. A good balance in diagnosis is to start with pattern matching, and if that does not quickly yield answers, switch to incremental diagnosis.
Finally, let's touch back on that use-case of alerts.
In previous essays, we discussed how you should avoid urgently alerting on underlying causes because they don't necessarily represent real production problems. However, you can use non-urgent alerts of underlying causes to aid in diagnosis. Put them in a dashboard instead of sending SMS to your team. When an operator is debugging a symptom, they can check this dashboard to make their investigation simpler.
The following are the steps for non-urgent alerts:
Instead of logging into machines and reading graphs, an operator can follow the trail of underlying cause alerts. Incidents that would have taken hours can be driven down to minutes if the right alerts are present.
That's a good place to stop for this essay. In the next essay, we will discuss how the adjustments to remediate incidents themselves can be risky and cause more incidents! Have a great day!
That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.
This essay was published 2020-05-31.