davidmah.com

SRE 01: Site Reliability Outcomes

This is part 1 of
Site Reliability Engineering Birds Eye View

Welcome to my essay sequence "Site Reliability Engineering Birds Eye View". I hope you will find this useful.

This is an outside-in introduction to Site Reliability Engineering. We'll be focusing on principles and philosophy of Site Reliability, rather than specific technical scenarios. The goal is to provide you with a useful framework for seeing how technical scenarios fit into the big picture.

Let's jump in by anchoring ourselves to the objectives of Site Reliability. Everything to be discussed flows back into these objectives, and when they don't, they are not part of Site Reliability.

Your production systems can be measured on their relationship with your users' tolerance for each of these essential reliability dimensions —

Latency - System request response time does not exceed a tolerable timespan.
Availability - Systems fail to respond to a tolerable fraction of requests.
Durability - Systems enter irrecoverable situations for only a tolerable number of users with tolerably few recurrences.

If you have a deployed production system, the presence of reliability risks is absolute. Your work in Site Reliability is to systematically drive these risks downward by targeting the probability and magnitude of the incidents that affect these dimensions.

What is that process? For each dimension —

Define your goalposts and SLOs based on what is relevant for the business.
Model the incident response pipeline and institute automated or semi-automated mechanisms for each phase.
- Protection - To guard against scenarios that induce failures.
- Verification - To check that now and beyond that failures are not occurring.
- Remediation - To enable quick recovery when failures occur.

By instituting mechanisms as systems in these phases, you create confidence in the on-going quality of your production systems.

To emphasize the critical nature of these mechanisms, let's take the absence of protocols like these with an extreme example. Imagine deploying a major feature to production without any unit tests. That would be unacceptable and would bring you quite a bit of discomfort! It would be difficult to create lasting confidence in the algorithmic correctness of the code.

Whereas you create confidence in algorithmic correctness via unit testing, you create confidence in reliability via protections, verifications, and remediations.

To enhance your intuition about reliability, you must port this feeling about about what is acceptable in algorithmic correctness over to latency, availability, and durability.

That's a good place to stop for this essay. In the next essay, we will discuss the incident response pipeline, which is the fundamental framework that you will use to model your efforts against each reliability dimension. Have a great day!

Site Reliability Engineering Birds Eye View

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-05-31.