How my team do root cause analysis

This blog is more or less a copy and paste of a wiki page that my team at work use as part of our Problem Management process.  It is heavily inspired by lots of good writing about blameless postmortems for example from Etsy and the Beyond Blame book.  Hope you find it useful.

RCA Approach

 

This page describes a 7 step approach to performing RCAs.  The process belongs to all of us, so please feel free to update it.

Traditionally RCA stands for Root Cause Analysis.  However, there are two problems with this:

  1. It implies there is one root cause.  In practice it is often a cocktail of contributing causes as well as negative (and sometimes positive) outcomes
  2. The name implies that we are on a hunt for a cause.  We are on a hunt for causes, but only to help us identify preventative actions.  Not just to solve a mystery or worse find an offender to punish.

Therefore RCA is proposed to stand for Recurrence Countermeasure Analysis.

Step 1: Establish “the motive”

Ask the following:

Question: Does anyone think anyone in our team did something deliberately malicious to cause this?  i.e. they consciously carried out actions that they knew would cause this or something of similar negative consequences or they clearly understood the risks but cared so little that they weren’t deterred?

and

Question: Does anyone think anyone outside our team… (as above).

The assumption here is that the answer is “NO” to both questions.  If it is “NO”, we can now proceed with a blameless manner, i.e. never stopping our analysis at a point where a person should (or could) have done something different.

If either answers are “YES”.  This is beyond the scope of this approach.

Step 2: Restate our meaning of “Blameless”

Read aloud the following to everyone participating in the RCA:

“We have established that we don’t blame any individual either internal or external to our organisation for the incident that has triggered this exercise.  Our process has failed us and needs our collective input to improve it.  If at any point during the process anyone starts to doubt this statement or act like they no longer believe it we must return to Step 1.  Everyone is responsible for enforcing this.

What is at stake here is not just getting to the bottom of this incident, it’s getting to the bottom of this incident and every future occurrence of the same incident.  If anyone feels mistreated by this process, by human nature they will take actions in the future to disguise their actions to limit blame and this will damage our ability to continuously improve.”

Step 3: Restate the rules

During this process we will follow these rules:

  1. Facts must not be subjective.  If an assertion of fact cannot be 100% validated we should agree and capture our confidence level (e.g. High, Medium, Low).  We must also capture the actions that we could do to validate it.
  2. If we don’t have enough facts, we will prioritise the facts that we need go away and validate before reconvening to continue.  Before suspending the process, agree a full list of “Things we wish we knew but don’t know”, capture the actions that we could do to validate them and prioritise the discovery.
  3. If anyone feels uncomfortable during the process due to:
    1. Blame
    2. Concerns with the process
    3. Language or tones of voice
    4. Their ability have their voice heard they must raise it immediately.
  4. We are looking for causes only to inform what we can do to prevent re-occurrence, not to apportion blame.

Step 4: Agree a statement to describe the incident that warranted this RCA

Using an open discussion attempt to reach a consensus over a statement that describes the incident that warranted this RCA.  This must identify the thing (or things) that we don’t want to happen again (including all negative side-effects).  Don’t forget the impact on people e.g. having to work late to fix something.  Don’t forget to capture the problem from all perspectives.

Write this down somewhere everyone can see.

Step 5: Mark up the problem statement

Look at the problem statement and identify and underline every aspect of the statement that someone could ask “Why” about.  Try to take an outsider view, even if you know the answer or think something cannot be challenged, it is still in scope for being underlined.

Step 6: Perform the analysis

Document the “Why” question related to each underlined aspect in the problem statement.

For each “Why” question attempt to agree on one direct answer.  If you find you have more than one direct answer, split your “Why” question into enough more specific “Why” questions so that your answers can be correlated directly.

Mark up the answers as you did in Step 5.

Repeat this step until you’ve built up a tree with at least 5 answers per branch and at least 3 branches.  If you can’t find at least 3 branches, you need to ask more fundamental “Why” questions about your problem statement and answers.  If you can’t ask and answer more than 5 “Why”s per branch possibly you are taking too large steps.

Do not stop this process with any branch ending on a statement that could be classified “human error”.  (Refer to what we agreed at step 1).

Do not stop this process at something that could be described as a “third party error”.  Whilst the actions of third parties may not be directly under our control, we have to maintain a sense of accountability for the problem statement where if necessary we should have implemented measures to protect ourselves from the third party.

Step 7: Form Countermeasure Hypothesis

Review the end points of your analysis tree and make hypothesis’ about actions that could be taken to prevent future re-occurrences. Like all good hypothesis’ these should be specific and testable.

Use whatever mechanism you have for capturing and prioritising the proposed work to track the identified actions and get them implemented.  Use your normal approach to stating acceptance criteria and don’t close the actions unless they satisfy the tests that they have been effective.

 

Jenkins in the Enterprise

Several months ago I attended a conference about Continuous Delivery. The highlight of the conference was a talk from Kohsuke Kawaguchi the original creator of Jenkins. I enjoyed it so much, I decided to recapture it in this blog.

To the uninitiated, Jenkins is a Continuous Integration (CI) engine. This means basically it is a job scheduling tool designed to orchestrate automation related to all aspects software delivery (e.g. compilation, deployment, testing).

To make this blog clear, two pieces of Jenkins terminology are worth defining upfront:

Job – everything that you trigger to do some work in Jenkins is called a Job. A Job may have multiple pre-, during, and post- steps where a step could be executing a shell script or invoking an external tool.

Build – an execution of a Job (irrespective of whether that Job actually builds code or does something else like test or deploy code). There are numerous ways to trigger a Job to create a new Build, but the most common is by poling your version control repository (e.g. GIT) and automatically triggering when new changes are detected.

Jenkins is to some extent the successor of an earlier tool called Cruise Control that did the same thing, but was much more fiddly to configure. Jenkins is very easy to install (seriously try it), configure, and also very easy to extend with custom plugins. The result is that Jenkins is the most installed CI engine in the world with around 64,000 tracked installations. It also has around 700 plugins available for it which do everything from integrate with version control systems, to executing automated testing, to posting the status of your build to IRC or Twitter.

I’ve used Jenkins on and off since 2009 and when I come back to it, am always impressed at how far it has developed. As practices of continuous delivery have evolved, predominately through new plugins, Jenkins has also kept up (if not enabled experimentation and innovation). Hence the prospect of hearing the original creator talking about the latest set of must have plugins, was perhaps more exciting to me that I should really let on!

Kohsuke Kawaguchi’s lecture about using Jenkins for Continuous Delivery

After a punchy introduction to Jenkins, Kohsuke spent his whole lecture taking us through a list of the plugins that he considers most useful to implementing Continuous Delivery. Rather than sticking to the order that he presented them, I’m going to describe them in my own categories: Staples that I use; Alternative Solutions to Plugins I Use; New To Me.

NB. it is incredibly easy to find the online documentation for these plugins, so I’m not going to go crazy and hyperlink each one, instead please just Google the plugin name and the word Jenkins.

Staples that I use

First up was the Parameterised Builds Plugin. For me this is such a staple that I didn’t even realise it is a plugin. This allows you to alter the behavior of a Job by supplying different vales to input parameters. Kohsuke liked this to passing arguments to a function in code. The alternative to using this is to have lots of similar Job definitions, all of them hard-coded for their specific purpose (bad).

Parameterised Trigger was next. This allows you to chain a sequence Jobs in your first steps towards creating a delivery pipeline. With this plugin, the upstream Job can pass information to the downstream Job. What gets passed is usually a subset of its own parameters. If you want to pass information that is derived inside the upstream Job, you’ll need some Groovy magic… get in touch if you want help.

Arguably this plugin is also the first step towards building what might be a complex workflow of Jenkins Jobs, i.e. where steps can be executed in parallel and follow logical routing, and support goodness like restarting/skipping failed steps.

Kohsuke described a pattern of using this plugin to implement a chain of Jobs where the first Job triggers other Jobs, but does not complete until all triggered Jobs have completed. This was a new idea to me and probably something worth experimenting with.

The Build Pipeline view plugin was next. This is for me the most significant UI development that Jenkins has ever seen. It allows you to visualise a delivery pipeline as a just that, if you’ve not seen it before, click here and scroll down to the screenshot. Interestingly, the plugin hadn’t had an new version published for nearly a year and this was asked about during the Q&A (EDIT: it is now under active development again!). Apparently as can happen with Jenkins plugins, the original authors developed it to do everything they needed and moved on. A year later 3000 more people have downloaded it and thought of their own functionality requests. It then takes a while for one of those people to decide they are going to enhance it, learn how to and submit back. Two key features for me are:

  1. The ability to manually trigger parameterised jobs (I’ve got a Groovy solution for this, get in touch if you need it) (EDIT: now included!)
  2. The ability to define end points of the pipeline so that you can create integrated pipelines for multiple code bases.

The Join plugin allows you to trigger a Job upon the completion of more than one predecessor. This is essential if you want to implement a workflow where (for example) Job C waits for both A and B to complete.

This is good plugin, but I do have a word of caution that you need to patch your Build Pipeline plugin if you want it to display this pattern correctly (ask me if you need instructions).

The Dependency Graph plugin was mentioned as not only a good way of automatically visualising the dependencies between you Jobs, but also allowing you to create basic triggering using the JavaScript UI. This one I’ve used locally, but not tried on a project yet. It seems good, I’m just slightly nervous that it may not be compatible with all triggers. However, on reflection, using it in read-only mode would still be useful and should be low risk.

The Job Config History plugin got a mention. If you use Jenkins and don’t use this, my advice is GET IT NOW! It is extremely useful for tracking Job configuration changes. It highlights Builds that were the first to include a Job configuration change. It allows you do diff old and new versions of a Job configuration in a meaningful way directly in the Jenkins UI. It tells you who made changes and when. AND it lets you roll back to old versions of Jobs (particularly useful when you want to regress a change – perhaps to rule out your Job configurations changes being accountable for a bad change causing a code build failure).

Alternative Solutions to Plugins I Use

Jenkow plugin, the Build Flow plugin and the Job DSL plugin were all recommended as alternative methods of turning individual Jobs into a pipeline or workflow.

Jenkow stood out in my mind for the fact that it stores the configuration in a GIT repository which is an interesting approach. I know I’m contradicting my appreciation of the Job Config History plugin here, but I’m basically just interested to see how well this works. In addition, Jenkow supports defining the workflows in BPMN which I guess is great if you speak it and even if not, good that it opens up use of many free BPMN authoring tools. All of these seem to have been created to support more advanced workflows and I think it is encouraging that people have felt the need to do this.

The only doubt in my mind is how compatible some of these will be with the Build Pipeline plugin which for me is easily the best UI in Jenkins.

New Ideas To Me

The Promoted Builds plugin allows you to manually or automatically assign different promotions to indicate the level of quality of a particular build. This will be a familiar concept to anyone who has used ClearCase UCM where you can update the promotion levels of a label. I think in the context of Jenkins this is an excellent idea and I plan to explore whether it can be used to track sign-off of manual testing activities.

Fingerprinting (storing a database of checksums for build artefacts) was something that I knew Jenkins could do, but have never looked to exploit. The idea is that you can track artefact versions used in different Jobs. This page gives a good intro

The Activiti plugin was also a big eye opener to me. It seems to be an open source business process management (BPM) engine that supports manual tasks and has the key mission statement of being easy to use (like Jenkins). The reason this is of interest to me is that I think it’s support for manual processes could be a good mechanism for gluing Jenkins and continuous delivery into large existing enterprises rather than having some tasks hidden in Jenkins and some hidden elsewhere. I’m also interested in whether this tool could support the formal ITIL-eque release processes (for example CAB approvals) which are still highly unlikely to disappear in a cloud of DevOps smoke.