davidmah.com

SRE 14: Availability - Protections via Defensive Algorithms

This is part 14 of
Site Reliability Engineering Birds Eye View

Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.

Previously, we discussed latency creep and how latency slowly degrades if nobody is watching.

Now, we are moving to Availability, which can manifest quite differently from latency. Because of the variety and spontaneity of availability incidents, the need for Defensive Algorithms is unavoidable. The challenge is finding the right places to put them.

Let's work our way in with an example.

Site is healthy and happy
An application server runs out of disk space
It starts returning errors
Site is no longer healthy and happy

Preparing for this is hard because you can't easily predict concrete circumstances. The example is just one of hundreds of potential ways to get similar errors. You could have a CPU overheat, a broken power cable, the list goes on and on.

Certainly you could make a list of all issues, and build a defense against each of them. That could be extra fans for your CPUs, secondary power lines, and so on. It's worth noting though that instead of trying to anticipate all of the specific scenarios, it's generally easier to protect against common redundancies that several scenarios share. The redundancy in this case is that several of these scenarios lead to the server responding to requests with an error.

As we are deploying a defensive algorithm, we will not try to stop the errors from happening in the first place. Instead, we will look at the client of this application server -- let's say that is your frontend javascript code in your users' web browsers.

Instead of exposing the error to the user, we can change your frontend code's functionality to retry the request to a different application server. With this functionality, any time a single server starts acting up and returning errors, frontends will adjust their behavior to succeed in spite of those errors.

With just this one change to add retries at the redundant position, while not solving all availability issues, it does solve a huge variety of single machine hardware incidents without even needing to go into detail about them.

That's a good place to stop for this essay. In the next essay, we will continue discussing availability through cascading incidents, where one tipping point leads to another, and now you have multiple incidents to remediate. Have a great day!

Site Reliability Engineering Birds Eye View

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-05-31.