Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.
Hi and welcome back. Previously, we began discussing latencies and how difficult diagnosis without instrumentation is.
This time, we'll expand on Latency by discussing Latency Creep, a phenomenon where systems get slower and slower over time without their teams being aware of it.
The life arc of Latency Creep is as follows --
The problem is that the latency oversight happened once during launch, as opposed to periodically.
Remember how we could use alerts on leading indicators to act as protections to incidents? We can use that here.
When the feature is deployed, install an alert triggered on moderate latency.
Let's walk through what would happen with an example. Say you launch a web page --
Three months pass and its 800ms. Three more months and it's just above 1200ms. The team gets an email triggered by the alert, and they're now aware of the latency creep. It's not at the SLO, so they don't panic. They cut a task and a few weeks later, somebody reduces it to 900ms. Through this, they've found a way to get ahead of the latency creep without having to wait for it to bite them first.
Finally, let's briefly touch on the percentile. You don't need perfection here, but you want to be shielded from outliers and from the over-reduction of information that averages yield. Thus, as a starting point, I'd recommend alerting on the 95th percentile. Expanding beyond that with more precise math is beyond the scope I'd like to address here.
That's a good place to stop for this essay. In the next essay, we will discuss availability and how to place defensive algorithms given the vast variety of potential underlying hardware incidents.
That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.
This essay was published 2020-05-31.