davidmah.com

SRE 09: Verification turned Protection: Leading Indicators

This is part 9 of
Site Reliability Engineering Birds Eye View

Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.

Previously, we discussed Alert Fatigue, and how it is caused by inappropriately treating underlying causes and correlations as urgent. I spent that post advocating for you to only urgently alert on symptoms.

Now, I'd like to point out another type of signal to urgently alert off of, the leading indicator. A leading indicator is a signal that reflects an imminently approaching symptom. During such a moment, no user is experiencing a problem. However, an incident will happen soon unless your team makes some adjustments to the system. Here's an example --

The database is at 80% memory utilization and it's creeping upwards.
Within an hour, it will hit 100%.
At that time, the site will go down.

For a scenario like this, it's worth having an alert to notify the team well before memory utilization reaches 100%. There is no user-facing symptom right now, however, getting the notification would allow the team to be pro-active and prevent the incident before it occurs. As this all occurs before an incident can happen, it fits into the Incident Response Pipeline not as a Verification, but as a Protection.

Leading indicators can involve a variety of timeframes. In the scenario above, the timeframe was hours.

But consider another example, database disk space is increasing by 1% every day. If you alert when disk utilization is at 90%, then you have 10 days before you hit 100% and the site stops working. They're both leading indicators, but the nature of how urgently to respond to them is quite different because of the timeframe. Leading indicators should only be treated as urgent if the timeframe is short and the imminent symptom is a guarantee. Otherwise, you risk Alert Fatigue, just like with underlying causes and correlations.

That's a good place to stop for this essay. In the next essay, we will begin discussing Remediation by discussing how operators diagnose incidents. We'll show that alerts can be useful for that too. Have a great day!

Site Reliability Engineering Birds Eye View

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-05-31.