Overreacting to Incidents

You don't have to fix everything

Imagine this conversation in an outage review meeting:

"Now that we've witnessed a full site outage, we should impose mandatory code review on all configuration changes!"

"That's over-reacting! Needing to wait for review for even small changes would slow the team's development a lot".

"But it's an inescapable fact that this outage happened, and so we need to prevent it from happening again."

The discussion stalls, no conclusion in sight.

The problem is that this team has not agreed on the value of preventing similar outages, and thus they are unable to quantify the amount of cost justifiable for their solutions.

Before diving into the solutions, somebody needs to ask, "How important is it that we prevent this from happening again, and what are we willing to bear for it?"

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-05-17.