davidmah.com

SRE 10: Remediation - Incremental Diagnosis and Underlying Causes

This is part 10 of
Site Reliability Engineering Birds Eye View

Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.

Previously, we discussed setting alerts on leading indicators, which is a way to use alerts as a Protection as opposed to a Verification.

There's another re-purposing we can do with those alerts, which is to aid an operator in the diagnosis phase of a Remediation workflow.

Let's discuss Remediation, specifically the diagnosis half. We'll plug in the alert use-case after that.

Remediation begins when an operator first becomes aware of the presence of an urgent incident.

Starting from the symptom, they diagnose the incident to understand the circumstances maintaining the incident.
This is followed by an adjustment to invalidate the circumstances maintaining the incident. Ideally, this adjustment is as minor of an adjustment as possible.

I'd like to highlight my intention behind the vocabulary, "circumstances maintaining the incident". You might have otherwise expected me to say "root cause", but that is not what I'm talking about.

Although an incident may have been the result of one chain of events, there are several factors that prolong the incident. Ideally, the operator identifies one factor to adjust to resolve the incident. This often can be done without the need to understand the original chain of events.

For instance, perhaps a new algorithm is non-performant, significantly increasing server CPU beyond a tipping point, and thus, the site goes down. As an adjustment, the operator can provision more servers to distribute the load. This resuscitates the site, and the operator never had to understand the source of the increased CPU utilization.

Although they don't need to understand the original chain of events, the operator does have to piece together why the incident continues to persist. To do this, when an operator begins diagnosis, they should start by inspecting the symptom. They should incrementally inspect underlying causes until they produce enough of an understanding to make the right adjustment.

Here is another example.

Symptom: Website is not responding.

Diagnostic actions:

Check web server logs, find out that requests stall waiting for the database.
Check database threads, find out queries stall waiting for table lock.
Check table lock, find out that a prolonged schema change is holding the table lock.

The characteristic to take note of is how one breadcrumb leads to another, in an 'A to B to C' sort of way. Staying on this trail makes it easy to see the forward progress in piecing together the diagnosis.

An alternative to incremental diagnosis is pattern matching. Look through a variety of diagnostic information, in the hope that an answer will be exposed. This is valid and can be a fast way to get answers to simple incidents. But if this attempt to pattern match doesn't succeed, then continuing a diagnostic approach in this way will feel aimless and endless. A good balance in diagnosis is to start with pattern matching, and if that does not quickly yield answers, switch to incremental diagnosis.

Finally, let's touch back on that use-case of alerts.

In previous essays, we discussed how you should avoid urgently alerting on underlying causes because they don't necessarily represent real production problems. However, you can use non-urgent alerts of underlying causes to aid in diagnosis. Put them in a dashboard instead of sending SMS to your team. When an operator is debugging a symptom, they can check this dashboard to make their investigation simpler.

The following are the steps for non-urgent alerts:

Operator gets alerted because of a symptom.
Operator checks dashboard populated with underlying-cause alerts.
In that dashboard, operator identifies an underlying cause corresponding to the symptom.
Operator deploys an adjustment to the underlying cause to fix the symptom.

Instead of logging into machines and reading graphs, an operator can follow the trail of underlying cause alerts. Incidents that would have taken hours can be driven down to minutes if the right alerts are present.

That's a good place to stop for this essay. In the next essay, we will discuss how the adjustments to remediate incidents themselves can be risky and cause more incidents! Have a great day!

Site Reliability Engineering Birds Eye View

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-05-31.