davidmah.com

SRE 13: Latency - Protections against Latency Creep via Leading Indicators

This is part 13 of
Site Reliability Engineering Birds Eye View

Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.

Hi and welcome back. Previously, we began discussing latencies and how difficult diagnosis without instrumentation is.

This time, we'll expand on Latency by discussing Latency Creep, a phenomenon where systems get slower and slower over time without their teams being aware of it.

The life arc of Latency Creep is as follows --

New feature is built
The team deploys it
They click around and test latency manually. It looks and feels great.
Over time, the team makes small edits to the algorithms, the data set gets bigger, servers more utilized, etc.
Over the same time, latencies increase as those small edits are made.
Eventually, the feature arrives at being an intolerably slow user experience, and the team is caught by surprise.

The problem is that the latency oversight happened once during launch, as opposed to periodically.

Remember how we could use alerts on leading indicators to act as protections to incidents? We can use that here.

When the feature is deployed, install an alert triggered on moderate latency.

What is moderate?
- High such that slowness is noticeable, but not to the degree of the SLO.
How urgent?
- Email, or some mechanism that gets attention without feeling urgent.
What sample?
- 95% percentile aggregated over a minute or so from samples.

Let's walk through what would happen with an example. Say you launch a web page --

It responds at 500ms and feels reasonable.
Your SLO is 1500ms seconds
You set your leading indicator alert to alert if the 95% percentile in any particular minute exceeds 1200ms.

Three months pass and its 800ms. Three more months and it's just above 1200ms. The team gets an email triggered by the alert, and they're now aware of the latency creep. It's not at the SLO, so they don't panic. They cut a task and a few weeks later, somebody reduces it to 900ms. Through this, they've found a way to get ahead of the latency creep without having to wait for it to bite them first.

Finally, let's briefly touch on the percentile. You don't need perfection here, but you want to be shielded from outliers and from the over-reduction of information that averages yield. Thus, as a starting point, I'd recommend alerting on the 95th percentile. Expanding beyond that with more precise math is beyond the scope I'd like to address here.

That's a good place to stop for this essay. In the next essay, we will discuss availability and how to place defensive algorithms given the vast variety of potential underlying hardware incidents.

Site Reliability Engineering Birds Eye View

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-05-31.