SRE 03: Business Anchored Goal Posts

Previous
This is part 3 of
Site Reliability Engineering Birds Eye View
Next

Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.

Previously, we discussed the Incident Response Pipeline as a way to understand the business value of reliability.

This time, we continue to contextualize reliability by identifying your goal posts. If your systems are stable right now, should you invest in reliability? How much should you invest in reliability?

The most critical insight that determines your investment in reliability and its relative importance is how much a gap in reliability costs your business. Recall the incident response pipeline equation -- if your business cost of a reliability incident is 0, then the value of reliability is also 0. Hence, intelligent prioritization of reliability can only be discussed within the context of your business.

Without the business context, if you were to ask something specific like, "should we have database backups?" I could not compute a reasoned answer for you. Sure, backups would not hurt, but working on that may not be worth the opportunity cost of other projects your team could be doing.

Thus, we'll start with the business context. It's not a practical mathematical reality to try to condense your reliability situation into a straightforward recurring cash cost number, so we won't try. Instead, we will bound the business context in lower bound and upper bound goal posts.


For your lower bound goal posts -- what level of unreliability would yield significant reductions in user and customer engagement?

  • Latency - How delayed do your system responses need to be for customers to end their contracts?
  • Availability - How long and how often does your system need to be down before customers are frustrated enough to seek a replacement?
  • Durability - If your users lose data, would they lose faith in your system?

Reliability worse than your lower bound will likely lead to the destruction of your business.


For your upper bound goal posts -- what level of reliability would not yield any difference in the actions of your user and customer base?

  • Latency - How delayed do your system responses need to be for customers to notice?
  • Availability - How long and how often does your system need to be down before customers take note?
  • Durability - If your users lose data, will they care? For which features does this apply?

Reliability better than these upper bound goal posts is of trivial assistance to your business. Staffing at this point should be reprioritized to other projects.


Your actual goal post sits somewhere between your lower bound and upper bound goal posts.

Let's follow the slide from unreliable to reliable --

Below the lower bound, the system is so unreliable that the business as a whole faces imminent failure. It needs to be fixed immediately.

As you improve beyond the lower bound, you accelerate the business by improving customer experiences and reclaiming time that your staff otherwise spends remediating systems.

As you slide towards the upper bound, you encounter diminishing returns, where similarly complex reliability projects bring insignificant business gains.

Above the upper bound, the system is so reliable that further improvements yield no business benefit. Hardly any company is here! And if they are here, by this definition they should deprioritize reliability so that staffing headcount can go elsewhere more important.

Your lower bound and upper bound goal posts are not meant to be mathematically precise. Defining them offers you a way to refine your intuition of when/how reliability is actually important given your business.

That's a good place to stop for this essay. In the next essay, we will discuss formalizing these goal posts into SLOs — metrics that your team uses to hold itself accountable to the goal posts. Have a great day!

Previous
Site Reliability Engineering Birds Eye View
Next

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-05-31.