How my team do root cause analysis

This blog is more or less a copy and paste of a wiki page that my team at work use as part of our Problem Management process.  It is heavily inspired by lots of good writing about blameless postmortems for example from Etsy and the Beyond Blame book.  Hope you find it useful.

RCA Approach

 

This page describes a 7 step approach to performing RCAs.  The process belongs to all of us, so please feel free to update it.

Traditionally RCA stands for Root Cause Analysis.  However, there are two problems with this:

  1. It implies there is one root cause.  In practice it is often a cocktail of contributing causes as well as negative (and sometimes positive) outcomes
  2. The name implies that we are on a hunt for a cause.  We are on a hunt for causes, but only to help us identify preventative actions.  Not just to solve a mystery or worse find an offender to punish.

Therefore RCA is proposed to stand for Recurrence Countermeasure Analysis.

Step 1: Establish “the motive”

Ask the following:

Question: Does anyone think anyone in our team did something deliberately malicious to cause this?  i.e. they consciously carried out actions that they knew would cause this or something of similar negative consequences or they clearly understood the risks but cared so little that they weren’t deterred?

and

Question: Does anyone think anyone outside our team… (as above).

The assumption here is that the answer is “NO” to both questions.  If it is “NO”, we can now proceed with a blameless manner, i.e. never stopping our analysis at a point where a person should (or could) have done something different.

If either answers are “YES”.  This is beyond the scope of this approach.

Step 2: Restate our meaning of “Blameless”

Read aloud the following to everyone participating in the RCA:

“We have established that we don’t blame any individual either internal or external to our organisation for the incident that has triggered this exercise.  Our process has failed us and needs our collective input to improve it.  If at any point during the process anyone starts to doubt this statement or act like they no longer believe it we must return to Step 1.  Everyone is responsible for enforcing this.

What is at stake here is not just getting to the bottom of this incident, it’s getting to the bottom of this incident and every future occurrence of the same incident.  If anyone feels mistreated by this process, by human nature they will take actions in the future to disguise their actions to limit blame and this will damage our ability to continuously improve.”

Step 3: Restate the rules

During this process we will follow these rules:

  1. Facts must not be subjective.  If an assertion of fact cannot be 100% validated we should agree and capture our confidence level (e.g. High, Medium, Low).  We must also capture the actions that we could do to validate it.
  2. If we don’t have enough facts, we will prioritise the facts that we need go away and validate before reconvening to continue.  Before suspending the process, agree a full list of “Things we wish we knew but don’t know”, capture the actions that we could do to validate them and prioritise the discovery.
  3. If anyone feels uncomfortable during the process due to:
    1. Blame
    2. Concerns with the process
    3. Language or tones of voice
    4. Their ability have their voice heard they must raise it immediately.
  4. We are looking for causes only to inform what we can do to prevent re-occurrence, not to apportion blame.

Step 4: Agree a statement to describe the incident that warranted this RCA

Using an open discussion attempt to reach a consensus over a statement that describes the incident that warranted this RCA.  This must identify the thing (or things) that we don’t want to happen again (including all negative side-effects).  Don’t forget the impact on people e.g. having to work late to fix something.  Don’t forget to capture the problem from all perspectives.

Write this down somewhere everyone can see.

Step 5: Mark up the problem statement

Look at the problem statement and identify and underline every aspect of the statement that someone could ask “Why” about.  Try to take an outsider view, even if you know the answer or think something cannot be challenged, it is still in scope for being underlined.

Step 6: Perform the analysis

Document the “Why” question related to each underlined aspect in the problem statement.

For each “Why” question attempt to agree on one direct answer.  If you find you have more than one direct answer, split your “Why” question into enough more specific “Why” questions so that your answers can be correlated directly.

Mark up the answers as you did in Step 5.

Repeat this step until you’ve built up a tree with at least 5 answers per branch and at least 3 branches.  If you can’t find at least 3 branches, you need to ask more fundamental “Why” questions about your problem statement and answers.  If you can’t ask and answer more than 5 “Why”s per branch possibly you are taking too large steps.

Do not stop this process with any branch ending on a statement that could be classified “human error”.  (Refer to what we agreed at step 1).

Do not stop this process at something that could be described as a “third party error”.  Whilst the actions of third parties may not be directly under our control, we have to maintain a sense of accountability for the problem statement where if necessary we should have implemented measures to protect ourselves from the third party.

Step 7: Form Countermeasure Hypothesis

Review the end points of your analysis tree and make hypothesis’ about actions that could be taken to prevent future re-occurrences. Like all good hypothesis’ these should be specific and testable.

Use whatever mechanism you have for capturing and prioritising the proposed work to track the identified actions and get them implemented.  Use your normal approach to stating acceptance criteria and don’t close the actions unless they satisfy the tests that they have been effective.

 

Advertisements

Security, it’s time to dance with the DevOps

Security are the next people in mainstream IT organisations (horses if you prefer) who need to start writing lots of code.  Many other types of people have already joined the coding party:

  • Developers started it(!),
  • Testers joined (automated testing done properly i.e, not recording mouse clicks),
  • Infrastructure and Networking joined (IaaS and SDN),
  • Middleware teams joined (e.g. Chef, Docker),
  • …and the latter 3 even joined together to build platform applications (PaaS / PaaA).

Only Security teams are still gingerly sipping a drink on the edge of the dance floor and watching the others.  Or perhaps waiting until the end of each song and shouting out complaints about anything making them unhappy.

So dance (code) Security teams… you’ll be fine and the party needs you.

We know Deming wouldn’t be happy with practices such as 11th hour pen testing.

Inspection does not improve the quality, nor guarantee quality. Inspection is too late. The quality, good or bad, is already in the product. As Harold F. Dodge said, “You can not inspect quality into a product.”

There are lots of neat ways Security teams can write code:

  • Build-time checks on 3rd party dependencies e.g. the OWASP Dependency-Check Plugin or Black Duck (even legal departments may want to dance for that)
  • Unit tests
  • Static code analysis focused on security e.g. SonarQube
  • Write runtime security checks (heck even do BDD and use something like Cucumber to capture malicious use cases).
  • Penetration testing during a continuous delivery pipeline e.g. using OWASP ZAPP
  • Many more opportunities I’m sure – feel free to add them as comments to this blog.

I’m not suggesting this idea is new, for example this document was from 2006.  I just still don’t see evidence of enough of it.

So Security teams – if you haven’t already, it’s time to download an IDE!