SRE 17: Durability - The Need to Test Recovery Workflows

Previous
This is part 17 of
Site Reliability Engineering Birds Eye View
Next

Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.

Previously, we discussed how horrifying a durability incident can be and how to protect against them using data redundancy.

This time, I'll suggest that Data Recovery methods really need to be tested periodically. Or else they won't work when you need them the most.

Let's say that you are concerned about a durability incident, so you begin taking periodic database backups.

It's great to have the backups, but how do you know that you can properly use a backup to complete a recovery? Unless you are periodically testing the recovery, you should not have confidence that data recovery would succeed.

Data recovery logic tends to have quite a lot of edge cases.

  • How do you know the backup actually has every proper piece of data?
  • What about all of the new data that was written after the backup was generated but before the database failure?
  • How do you know how long this recovery process is going to take?
  • Will your users tolerate the duration of the time for the recovery, or will it still crush the business?

During a data loss incident, it's a sad feeling to realize that your recovery workflow doesn't work. When the incident happens, you feel confident because you have redundancy, and then you feel crushed when you realize you actually don't have redundancy.

It's also quite embarrassing. So periodically test your recovery workflows! These workflows do not need to be fully automated, they just need to be tested regularly.

That's a good place to stop for this essay. The tour through the reliability dimensions is complete. In the final essay, we will tie it together again and discuss staffing. Have a great day!

Previous
Site Reliability Engineering Birds Eye View
Next

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-05-31.