SRE 06: Protection - Guard Rails. Prefer Mechanical, not Human.

Previous
This is part 6 of
Site Reliability Engineering Birds Eye View
Next

Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.

Previously, we discussed what makes up the Incident Resposne Building Blocks: Protection, Verification, and Remediation.

Now, we're going further into details, starting with Protections. Specifically, let's talk about Guard Rails, which are a form of protection.

Consider this example of a risk --

Non-performant database table migrations can lock the table
for extended periods of time,
causing it to no longer respond to any queries.

Consider these two options of two guard rail mechanisms to protect against it.

  1. A pre-commit lint evaluates table migration code, detecting whether it would require a lengthy table lock. The commit is prevented if so.

Or

  1. Everybody on the team is taught about this risk. Code reviewers pay close attention to discuss table migrations to make sure no non-performant migrations are committed.

These are both guard rails. The first is mechanical and the other is human.

Between human and mechanical, which is better? The mechanical guard rail is preferable primarily because incidents are now guaranteed to not slip through, which is far from the case for a human guard rail.

Often, it seems easier to create a human guard rail — just make a new process and tell everyone on the team to adhere to it. That ease is a facade — you're now asking a lot from every one of your team members, both in inculcating education and adopting a new process without reward, that slows down what they consider to be their actual job. Processes like these are exactly the type of thing that makes working at big companies feel so slow, and the absence of these is part of the advantage startups wield. To preserve this advantage, it's important to hesitate against adding such processes.

While implementing a mechanical guard rail may take longer, the task only involves one engineer. Plus, once they're done, even that engineer can let themselves forget about it. It is a chunky one-time cost with a minimal recurring cost.

Of course, the exception is when implementing the mechanical guard rail involves some complex and extensive engineering. In that case, the human guard rail really is easier. In that case, just make sure you are recognizing the full extent of the cost of deploying a human guard rail.

Finally, don't forget -- beyond guard rails, you could alternatively improve reliability through different mechanisms. If the path forward with a guard rail is costly, you can back out and attack the pipeline somewhere completely different.

That's a good place to stop for this essay. In the next essay, we will discuss protection through defensive algorithms. Have a great day!

Previous
Site Reliability Engineering Birds Eye View
Next

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-05-31.