Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.
Previously, we discussed availability, and how to capture hardware errors at their redundant points using defensive algorithms.
This time we expand on availability by discussing Cascading Incidents, where one incident leads to another simultaneous incident, or perhaps several.
Imagine this series of events --
This must sound like some kind of nightmare. It really is a nightmare, as this is how major site outages happen.
I'd like you to notice that the original scenario that caused this, the power failure, is not relevant to the sustained failure and how to remediate it. Worse yet, fixing one server does not fix the incident. Worse again, the servers get rebooted shortly after coming back online, so it's very hard to log in to servers and investigate.
It gets chaotic when multiple operators get involved. It's common for one operator to start flailing, running whatever commands they can think of to hopefully resuscitate the system. Perhaps they would try to reboot the machines simultaneously and get them online at the same time. Unfortunately, they are contributing to the problem as their actions make it harder for others on the team to investigate and understand.
A team that is unable to understand the driving dynamic behind the incident will not be able to solve this. In this case, the driving dynamic is memory over-utilization. Once they understand this, a clear solution will emerge. In this example, one solution would be to provision more machines to increase the total memory pool and prevent the panics.
Let's turn this example into takeaways.
Start before any incident, in anticipation of cascades.
Cascading incidents are chaos and they sound extreme, but I haven't heard of a start-up that has not encountered cascading failures. Most start-ups aren't prepared the first time, and the outage ends up a multi-hour experience. It's particularly common for this to happen during a major launch, where the surge of new traffic brings production systems beyond tipping points that they had ever been tested for. It's so common, that it's actually just the baseline expectation to have of new systems. Unfortunately, that's also the worst time for this to happen. The pain of this disaster is when it tends to dawn on companies to take reliability seriously.
That's a good place to stop for this essay. In the next essay, we will move onto durability, which involves risks that are low chance but potentially business ending. Have a great day!
That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.
This essay was published 2020-05-31.