davidmah.com

SRE 11: Remediation - Adjustments Themselves Induce Risky Scenarios

This is part 11 of
Site Reliability Engineering Birds Eye View

Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.

Previously, we discussed diagnosis, which is how an operator comes to understand enough of the incident to plan an adjustment to resolve the incident.

This time, we will discuss Adjustments.

I'd like to highlight my vocabulary here. Instead of saying "Adjustment", I could have said something like "a fix". I have a specific intention here, which is to point out that the adjustment deployed might not actually fix the incident. In fact, it might even make things worse.

An unfortunate reality is that a great number of incidents are caused by adjustment to a prior incident. It is very common to create a cascade of new problems when trying to fix existing ones.

Imagine this example of an incident respond chat thread --

[11:04] Mary: I just got paged, the site is responding very slowly
[11:05] Mary: I think there is a memory leak in the application servers
              and that is causing the high latencies
[11:06] Jim: Let's band-aid it by restarting all of the processes
[11:07] Mary: Done. I ran a restart everywhere. Let's see if it recovers
[11:09] Mary: It seems to not have helped?
[11:10] Mary: Hmm.. now the site isn't even responding
[11:11] Jim: Oh jeez, the database is slammed
[11:11] Jim: At start time, a feature loads a bunch of data from
               the database and now they're all happening
               simultaneously because of the restarts
[11:12] Jim: Sigh, sorry for suggesting that. Let's figure out
               how to recover the database.

This is an example where an adjustment actually causes a second incident. This might seem unlikely, contrived even. Probably just a one-off. And how could Jim and Mary have known? That's exactly how it feels to teams when it happens to them. However, that's not the correct interpretation. Labeling these occurrences as "unlikely" is a downward spiral towards systemically producing sustained outages for your business.

Your rate and chance of incidents are a function of system complexity, which can be visualized as the number of interdependent moving parts. When there are more interdependent moving parts, there is a higher chance of unexpected results. In this example, the adjustment itself simultaneously influenced many moving parts. Thus, a second incident was actually quite probable!

An operator running commands directly in production is akin to deploying major system changes without doing any testing first. That could yield bugs, and since it's live, those bugs will immediately be incidents.

During an incident, on an emotional level, operators feel rushed to fix the problem. Thus, on a cognitive level, their attention becomes myopic. They fail to comprehend the nuances of the adjustment actions they are taking. Instead of simply urging them to make sure that adjustments are safe, urge them to lean towards making small adjustments.

That's a good place to stop for this essay. In the next essay, we move to apply the incident response building blocks to the reliability dimensions, starting with latency. Have a great day!

Site Reliability Engineering Birds Eye View

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-05-31.