Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.
Previously, we discussed how horrifying a durability incident can be and how to protect against them using data redundancy.
This time, I'll suggest that Data Recovery methods really need to be tested periodically. Or else they won't work when you need them the most.
Let's say that you are concerned about a durability incident, so you begin taking periodic database backups.
It's great to have the backups, but how do you know that you can properly use a backup to complete a recovery? Unless you are periodically testing the recovery, you should not have confidence that data recovery would succeed.
Data recovery logic tends to have quite a lot of edge cases.
During a data loss incident, it's a sad feeling to realize that your recovery workflow doesn't work. When the incident happens, you feel confident because you have redundancy, and then you feel crushed when you realize you actually don't have redundancy.
It's also quite embarrassing. So periodically test your recovery workflows! These workflows do not need to be fully automated, they just need to be tested regularly.
That's a good place to stop for this essay. The tour through the reliability dimensions is complete. In the final essay, we will tie it together again and discuss staffing. Have a great day!
That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.
This essay was published 2020-05-31.