A debate rages in a software company's incident postmortem meeting --
"Why did the website stop working?"
"Well that was because database went down"
"And that was because somebody accidentally ran unit tests against the production database"
"And that was because it's not intuitive to realize that you are running commands in the wrong environment"
"We've found the root cause! To fix this, we must make it easier for our developers to realize which environment they are in so they don't run the wrong command!"
"No, the root cause is that the unit test command is usable in the production environment. How about we just disable the command when it's not the development environment?"
Hold it!
There is no single 'root' cause. There are only underlying causes. The chain of causation can be walked infinitely backwards.
It gets even more complex than that -- Some events require the conjunction of two simultaneous prior events. Both the failsafe needed to be missing AND the developer needed to be unaware of the environment they were working in. Either of these conditions can be invalidated and the incident would not have occurred.
The practical problem with focusing on 'root causes' is that it leads discussions like these towards one fix, and not towards multiple fixes of high leverage and low implementation cost. It doesn't matter which event seems ridiculous or unreasonable. What matters is moving forward.
Linguistically, when people say ‘root cause' they mean one or more of the following --
These are all reasonable ways to inspire improvement ideas, but its not the appropriate framework for choosing the improvements. The question to ask then, is what does this situation teach us about improvements we can make...
That's all for this essay. If you have a question or an interesting thought to bounce around, email me back at david@davidmah.com. I'd enjoy the chat.
This essay was published 2020-04-26.