Saturday, March 10, 2012

What I learned from John Allspaw and Eric Ries about root cause analysis

In his talk Advanced Postmortem Fu and Human Error 101 at the 2011 velocity conference, John Allspaw talked among other things about root cause analysis. One of his points was, that there is no such thing as a root cause for any given incident in complex systems. Its more like a coincidence of several things, that make a failure happen. I liked his visualization using several slides of Swiss cheese, where accidentally the holes of several slides of cheese are aligned in a way that a straight like can run through the holes, as a symbol for something bad happening.

In hindsight, there often seems to be a single action, that would have prevented the incident to happen. This one thing is searched for by managers when they do a root cause analysis. They hope to be able to prevent this incident from ever happening again. But if the incident was possible due to the coincidence of many events, it makes no sense to search for a singular root cause. This would lead to a false sense of security, as in a complex system there are to many ways, an incident can possibly happen.

Now I just recently read "Lean Startup" from Eric Ries. In one of the last chapters he suggested to use the five why method on incidents. So if something unexpected happens, ask why it happened. The next why is then applied to the answer of the previous question. First I thought, he is "only" looking for the root cause, and this would not make to much sense, as explained above. But there is the point that asking these questions will not only find an underlying cause, but better will bring to the light a chain of events leading to the incident. And Eric Ries recommends to find a counter measure on every level of the chain of events. This will make sure, that we are not just fixing the symptoms, but will improve the immune defense of our system.

I like that idea. It imposes much more work than only preventing the "root cause" but it gives a much better understanding of the system and is a good training for everybody.

No comments:

Post a Comment