davidmah.com

SRE 08: Verification - Alerting Fatigue

This is part 8 of
Site Reliability Engineering Birds Eye View

Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.

Previously, we discussed Protections, which you use to prevent scenarios from escalating into incidents.

This time, we dig into Verifications, which are mechanisms that bring timely awareness of incidents when they occur.

The primary form of verification is an Alert, which is a periodic system probe that checks for the presence of an incident. A simple implementation could be a cron that runs a script. This script would possibly send you an SMS notification depending on the outcome of an if statement.

The implementation of an alert is quite simple, so we won't discuss it further. The real complexity involves answering these questions for yourself:

What signals should I be alerting off of?
What level of urgency should any particular alert notify the team?

As defined for these two questions, a signal is an observable characteristic of your system during an incident. Urgency involves how the alert notifies your team member and how it asks or prompts them to remediate the incident. Does the alert trigger a phone call and say, "fix immediately"? Or does it trigger an email that says, "fix within 5 days?"

Without properly assessing these questions, before long you will experience Alert Fatigue. You'll have hundreds and hundreds of alerts sending SMS to your team, many of them perpetually doing so. The team will ignore them because they're irrelevant, not actionable, or simply overloading bandwidth. This leads to legitimate alerts slipping through the cracks. This is not much better than having no alerts at all.

Let's discuss signals and how they tie in with urgency.

When an incident occurs, a lot of aberrant dynamics are occurring in your systems. These dynamics produce signals that can be classified as symptoms, underlying causes, or correlations. Here is an example of three signals that might be observable simultaneously.

Symptom - Website is not responding
Underlying Cause - Database threads are all stalled waiting on row locks
Correlation - Application server CPU average is 80%, when it's usually 40%

A Symptom is an actual problem as experienced by a user. If they are experiencing a meaningful degradation of service, then this reflects a business-driven reason to have an urgent response to remediate the alert.

An Underlying Cause is a characteristic that lead to the symptom. It's not what the user is experiencing, but it points you towards how to remediate the incident. The same underlying causes do not necessarily lead to the same symptoms. Therefore, it is dangerous to put an alert on underlying causes. For example, you might encounter an outage caused by high database CPU utilization. In the future, you might encounter high database CPU utilization while the site is perfectly healthy. Alerting on high database CPU utilization yields an alert that is not always meaningful.

A Correlation is something else that tends to happen at the same time. Alerting on correlations yields the same negatives as alerting on Underlying Causes with an even higher degree of noisiness.

Alert Fatigue comes from irrelevant or unactionable alerts, which occur because underlying causes and correlations are being treated as symptoms.

For example, the site has an outage, the team notices high app server CPU during the outage, and so they add an alert on it. Later on, an operator gets paged at 3 am for high app server CPU, and they're frustrated because the site is perfectly healthy. This person learns to ignore the alert. Over time, as more of the team experiences incidents like these, they will pick up the habit of questioning the validity of the alerts they receive.

Don't get me wrong, alerts on underlying causes and correlations can be useful. Especially if an alert acts as a high signal leading indicator for a future symptom. They're also useful if you expose them to operators for use down the pipeline in remediation. These use cases will be discussed in later essays. The point for this essay is that these alerts are not to be marked as urgent.

Symptoms should be treated as urgent. Underlying Causes and Correlations should be treated as less than urgent.

That's a good place to stop for this essay. In the next essay, we will discuss leading indicators, which are another signal that is a protection masquerading as a verification. Have a great day!

Site Reliability Engineering Birds Eye View

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-05-31.