davidmah.com

SRE 07: Protection - Defensive Algorithms

This is part 7 of
Site Reliability Engineering Birds Eye View

Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.

Previously, we discussed Guard Rails, a form of Protection.

This time we discuss an alternate form of Protection — Defensive Algorithms.

Let's start by painting a metaphor using automobile travel. There is driving, and there is "defensive driving". If you're driving defensively then you're on the lookout for dangers in the road, and when danger comes to your attention you adjust your driving to avoid it. This approach enables you to drive more safely despite the dangers of the road.

Defensive algorithms have a similar characteristic. They are written with the expectation that errors happen and when they do, algorithms produce alternate behavior that allows success in spite of those errors.

Let's make this concrete using a technical example. Consider this potential error --

Your app runs on your user's mobile phone, and every 60 seconds
it requests some information from your servers. What happens if
the user's internet access briefly stalls right at the moment
the request is made, yielding an error?

Now consider these two ways of dealing with it --

The error propagates through the app which stops working properly.
The user's experience of the product is degraded.

The code that makes the request waits 3 seconds before trying again
and succeeding.
The user witnesses a hiccup, but the experience overall is fine.

In the first case, an error arose and lead to a much bigger error, with business consequences. In the second case, the same error arose, but the experience did not degrade because of the alternative behavior used by the defensive algorithm.

It's not really ideal that errors like this can happen, but some errors are out of your control or would take too much effort to fix. In those situations, using defensive algorithms lets you work around the errors. When you encounter an error that is hard to fix at its root, ask yourself, "instead of fixing this error, could I more easily mitigate this by adjusting our code to work around the error?". Your defensive algorithms cannot catch everything, but they can reduce the chance and magnitude that errors propagate up to negatively affect your user experience.

Finally, let's distinguish between defensive algorithms and guard rails. Whereas a guard rail only kicks in at the onset of an incident, defensive algorithms are within the natural arc of operations. They are both protections, but they come into play at different times. Which protection is appropriate to deploy depends on the situation you are dealing with.

Start with this heuristic:

Defensive Algorithms - When you depend on subsystems that are prone to failures, lean towards mitigating the risk using Defensive Algorithms.
Guard Rails - When a risk is an exceptional circumstance, lean towards mitigating the risk using Guard Rails.

That's a good place to stop for this essay. In the next essay, we move to Verifications, and how they're more nuanced than just true/false probes. Have a great day!

Site Reliability Engineering Birds Eye View

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-05-31.