davidmah.com

SRE 15: Availability - Remediating Cascading Incidents

This is part 15 of
Site Reliability Engineering Birds Eye View

Welcome back to my essay sequence on Site Reliability Engineering. If you haven't read the previous essays in this series, please do because we'll continue to use vocabulary and concepts defined in them.

Previously, we discussed availability, and how to capture hardware errors at their redundant points using defensive algorithms.

This time we expand on availability by discussing Cascading Incidents, where one incident leads to another simultaneous incident, or perhaps several.

Imagine this series of events --

You have five application servers, each running at 70% memory utilization.
One application server is hit by a power surge, so it crashes and reboots.
Traffic meant for that server is now directed to the rest of the application servers.
Because of the increased traffic, a different application server hits 100% memory utilization and kernel panics, initiating its own reboot.
Because of the increased traffic, the rest of the application servers run out of memory and kernel panic, initiating their own reboots. At this point, there are no application servers to receive traffic.
The original server that crashed finally finished rebooting, receives all traffic, immediately hits 100% memory utilization, and kernel panics into another reboot.
Continuously, application servers finish rebooting only to kernel panic shortly after.

This must sound like some kind of nightmare. It really is a nightmare, as this is how major site outages happen.

I'd like you to notice that the original scenario that caused this, the power failure, is not relevant to the sustained failure and how to remediate it. Worse yet, fixing one server does not fix the incident. Worse again, the servers get rebooted shortly after coming back online, so it's very hard to log in to servers and investigate.

It gets chaotic when multiple operators get involved. It's common for one operator to start flailing, running whatever commands they can think of to hopefully resuscitate the system. Perhaps they would try to reboot the machines simultaneously and get them online at the same time. Unfortunately, they are contributing to the problem as their actions make it harder for others on the team to investigate and understand.

A team that is unable to understand the driving dynamic behind the incident will not be able to solve this. In this case, the driving dynamic is memory over-utilization. Once they understand this, a clear solution will emerge. In this example, one solution would be to provision more machines to increase the total memory pool and prevent the panics.

Let's turn this example into takeaways.

Start before any incident, in anticipation of cascades.

To really engineer tolerance, you must induce scenarios in advance and make sure your systems respond positively. For instance, you can simulate increased user counts using generated traffic. If your system doesn't respond well, you can fix it. Cascading failures happen when your systems are filled with landmines. Removing some of them will reduce the chance of cascading failures.
To remediate faster, you must have instrumentation prepared in advance. Operators start flailing only after initial diagnosis efforts leave them still confused.
To triage faster, train your team to quickly recognize cascading failures. They should be aware that in these situations, attempts to find shortcuts will be a disservice. They must expect to take their time to cleanly understand the driving dynamic of the incident.

Cascading incidents are chaos and they sound extreme, but I haven't heard of a start-up that has not encountered cascading failures. Most start-ups aren't prepared the first time, and the outage ends up a multi-hour experience. It's particularly common for this to happen during a major launch, where the surge of new traffic brings production systems beyond tipping points that they had ever been tested for. It's so common, that it's actually just the baseline expectation to have of new systems. Unfortunately, that's also the worst time for this to happen. The pain of this disaster is when it tends to dawn on companies to take reliability seriously.

That's a good place to stop for this essay. In the next essay, we will move onto durability, which involves risks that are low chance but potentially business ending. Have a great day!

Site Reliability Engineering Birds Eye View

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-05-31.