davidmah.com

There is no such thing as a 'Root Cause'

A debate rages in a software company's incident postmortem meeting --

"Why did the website stop working?"

"Well that was because database went down"

"And that was because somebody accidentally ran unit tests against the production database"

"And that was because it's not intuitive to realize that you are running commands in the wrong environment"

"We've found the root cause! To fix this, we must make it easier for our developers to realize which environment they are in so they don't run the wrong command!"

"No, the root cause is that the unit test command is usable in the production environment. How about we just disable the command when it's not the development environment?"

Hold it!

There is no single 'root' cause. There are only underlying causes. The chain of causation can be walked infinitely backwards.

The engineer who designed the production environment didn't make it failsafe.
The manager who hired that engineer could have hired a different engineer.
Instead of starting a software company, the CEO could have started a restaurant.

It gets even more complex than that -- Some events require the conjunction of two simultaneous prior events. Both the failsafe needed to be missing AND the developer needed to be unaware of the environment they were working in. Either of these conditions can be invalidated and the incident would not have occurred.

The practical problem with focusing on 'root causes' is that it leads discussions like these towards one fix, and not towards multiple fixes of high leverage and low implementation cost. It doesn't matter which event seems ridiculous or unreasonable. What matters is moving forward.

Linguistically, when people say ‘root cause' they mean one or more of the following --

An underlying cause that comes off as the most unreasonable.
The bottom-most bullet point that they arrived at during the exercise.
An underlying cause that if changed, would solve several other issues too

These are all reasonable ways to inspire improvement ideas, but its not the appropriate framework for choosing the improvements. The question to ask then, is what does this situation teach us about improvements we can make...

that will protect us from a variety of risks
and which of these improvements come at or below a reasonable cost to us?

That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.

This essay was published 2020-04-26.