Welcome to my essay sequence "Site Reliability Engineering Birds Eye View". I hope you will find this useful.
This is an outside-in introduction to Site Reliability Engineering. We'll be focusing on principles and philosophy of Site Reliability, rather than specific technical scenarios. The goal is to provide you with a useful framework for seeing how technical scenarios fit into the big picture.
Let's jump in by anchoring ourselves to the objectives of Site Reliability. Everything to be discussed flows back into these objectives, and when they don't, they are not part of Site Reliability.
Your production systems can be measured on their relationship with your users' tolerance for each of these essential reliability dimensions —
If you have a deployed production system, the presence of reliability risks is absolute. Your work in Site Reliability is to systematically drive these risks downward by targeting the probability and magnitude of the incidents that affect these dimensions.
What is that process? For each dimension —
By instituting mechanisms as systems in these phases, you create confidence in the on-going quality of your production systems.
To emphasize the critical nature of these mechanisms, let's take the absence of protocols like these with an extreme example. Imagine deploying a major feature to production without any unit tests. That would be unacceptable and would bring you quite a bit of discomfort! It would be difficult to create lasting confidence in the algorithmic correctness of the code.
Whereas you create confidence in algorithmic correctness via unit testing, you create confidence in reliability via protections, verifications, and remediations.
To enhance your intuition about reliability, you must port this feeling about about what is acceptable in algorithmic correctness over to latency, availability, and durability.
That's a good place to stop for this essay. In the next essay, we will discuss the incident response pipeline, which is the fundamental framework that you will use to model your efforts against each reliability dimension. Have a great day!
That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.
This essay was published 2020-05-31.