SRE 04: SLOs - Formalized Measurable Goal Posts

Previous
This is part 4 of
Site Reliability Engineering Birds Eye View
Next

Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.

Previously, we discussed goal posts for reliability, and that they should sit somewhere between being so unreliable that you lose customers and being so reliable that customers stop noticing.

The next step is to formalize your goal posts as metrics in SLOs, also known as Service Level Objectives. While the goal posts are a tool to model your intuition of reliability importance, your SLOs are a tool to keep you and your team accountable for adhering to your goal posts.

An SLO is a metrics and fact-driven yes/no question regarding the adherence of your systems to your goal posts over some pre-determined time span. Ideally, you can automate computation of the answer, but it's not necessary in all cases.

Let's flesh this concept out using examples, then return to the definition after that.

  • Latency/Minute - For each minute, 99% of user requests must have request latencies below 500ms.
  • Latency/Month - For each month, 90% of minutes must meet the Latency/Minute SLO.
  • Availability/Minute - For each minute, 99.9% of user requests must not yield errors.
  • Availability/Month - For each month, 95% of minutes must meet the Availability/Minute SLO.
  • Durability/Month - At most, 100 users per month can have their data enter a state where they would be forced to start over from scratch.

You're looking for a yes/no target to direct your team to, and ideally but not necessarily one that your automation can measure.

When introducing a new SLO, avoid overthinking where to pin the numbers. Start near the table stakes. You can tune it over time as you get accustomed to the process of adhering to it. Bias towards easy-to-compute SLOs as opposed to perfectly accurate SLOs representing business costs.

Acknowledge the need to account not just for different reliability dimensions, but also for different time horizons. You use a shorter time horizon to keep your team accountable for resolving incidents promptly. You use a longer time horizon to keep your team accountable for reliability improvement projects. Let's make that concrete by discussing how you might implement Availability/Minute and Availability/Month SLOs.

  • Availability/Minute - You have a graph that exposes your minute by minute error rate. When it fails SLO, an alert goes off, and a member of your team's phone gets automatically called. If the error rate does not improve, then the team member gets phoned every 5 minutes until they do resolve it.

  • Availability/Month - Every month, a staff member aggregates the error rate graphs in preparation for a full team meeting. In the meeting, you discuss the availability numbers. If it fails SLO, awareness of this fact is inescapable because of the meeting's focus. You discuss why this is happening, and you cut a task to ensure that somebody has the bandwidth to address the underlying issue. If the numbers remain below SLO month over month for similar reasons, then you discuss why the tasks are not improving the situation. Then, assign somebody to go do a deeper investigation of what the earlier tasks were missing.

Consider what would happen to the team if the SLOs and the process for observing them did not exist. The team would remain ignorant of underlying issues and not resolve them. The power of SLOs is that they guarantee accountability for your objectives.

That's a good place to stop for this essay. In the next essay, we will shift towards discussing Protection, Verification, and Remediation, which are the core building blocks of improving reliability.

Previous
Site Reliability Engineering Birds Eye View
Next

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-05-31.