SRE 16: Durability - Protections via Redundancy

Previous
This is part 16 of
Site Reliability Engineering Birds Eye View
Next

Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.

Previously, we discussed cascading availability incidents, which are horrifying because of how hard they are to remediate and how long they last.

This time, we'll move into Durability, which can be even more horrifying because of the possibility of spontaneously ending your business.

A durability incident is when you irrevocably lose user data, and affected users can't use your product anymore. The true reason that it's horrifying is that you've failed your users on the promises you made them, and they lose trust in your business's ability to keep its promises.

A simple example — imagine you are a service that stores photos for families. Families start to use your photo service to document decades of their family heritage. One day, the hard drive to one of your databases crashes, as they all eventually do. Because you have no backups, this data is irrecoverable.

What are you going to tell your users? That losing their heritage was just an accident, and that you didn't expect it to happen? Do you think revealing your good intentions heals the consequences of your failure? Is there anything you could say or do that would make it up to your users? After all of this, how could your users come to trust you ever again?

You're backed into a corner, and there is no good solution at this point. That's what makes it so horrifying. The solution is to never put the company into these situations in the first place. The answer is protection through redundancy.

Store each piece of user data in multiple isolated environments. That could be different hard drives in different buildings accessed by different networks. If you have a data loss event, you recover by re-cloning the copies.

No one data loss event should be able to reach all copies of any piece of data. A replica database that has traffic forwarded from its master database is insufficient because a ‘delete' query will flow through the database and hit the replica too. You need isolation across several dimensions, even if that means different kinds of redundancy. You can use both replicas and backups, and they will each serve for recovery in different scenarios.

That's a good place to stop for this essay. In the next essay, we will discuss how flimsy data loss recovery workflows tend to be, and thus how they should be considered useless unless tested. Have a great day!

Previous
Site Reliability Engineering Birds Eye View
Next

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-05-31.