How my team do root cause analysis

This blog is more or less a copy and paste of a wiki page that my team at work use as part of our Problem Management process.  It is heavily inspired by lots of good writing about blameless postmortems for example from Etsy and the Beyond Blame book.  Hope you find it useful.

RCA Approach

 

This page describes a 7 step approach to performing RCAs.  The process belongs to all of us, so please feel free to update it.

Traditionally RCA stands for Root Cause Analysis.  However, there are two problems with this:

  1. It implies there is one root cause.  In practice it is often a cocktail of contributing causes as well as negative (and sometimes positive) outcomes
  2. The name implies that we are on a hunt for a cause.  We are on a hunt for causes, but only to help us identify preventative actions.  Not just to solve a mystery or worse find an offender to punish.

Therefore RCA is proposed to stand for Recurrence Countermeasure Analysis.

Step 1: Establish “the motive”

Ask the following:

Question: Does anyone think anyone in our team did something deliberately malicious to cause this?  i.e. they consciously carried out actions that they knew would cause this or something of similar negative consequences or they clearly understood the risks but cared so little that they weren’t deterred?

and

Question: Does anyone think anyone outside our team… (as above).

The assumption here is that the answer is “NO” to both questions.  If it is “NO”, we can now proceed with a blameless manner, i.e. never stopping our analysis at a point where a person should (or could) have done something different.

If either answers are “YES”.  This is beyond the scope of this approach.

Step 2: Restate our meaning of “Blameless”

Read aloud the following to everyone participating in the RCA:

“We have established that we don’t blame any individual either internal or external to our organisation for the incident that has triggered this exercise.  Our process has failed us and needs our collective input to improve it.  If at any point during the process anyone starts to doubt this statement or act like they no longer believe it we must return to Step 1.  Everyone is responsible for enforcing this.

What is at stake here is not just getting to the bottom of this incident, it’s getting to the bottom of this incident and every future occurrence of the same incident.  If anyone feels mistreated by this process, by human nature they will take actions in the future to disguise their actions to limit blame and this will damage our ability to continuously improve.”

Step 3: Restate the rules

During this process we will follow these rules:

  1. Facts must not be subjective.  If an assertion of fact cannot be 100% validated we should agree and capture our confidence level (e.g. High, Medium, Low).  We must also capture the actions that we could do to validate it.
  2. If we don’t have enough facts, we will prioritise the facts that we need go away and validate before reconvening to continue.  Before suspending the process, agree a full list of “Things we wish we knew but don’t know”, capture the actions that we could do to validate them and prioritise the discovery.
  3. If anyone feels uncomfortable during the process due to:
    1. Blame
    2. Concerns with the process
    3. Language or tones of voice
    4. Their ability have their voice heard they must raise it immediately.
  4. We are looking for causes only to inform what we can do to prevent re-occurrence, not to apportion blame.

Step 4: Agree a statement to describe the incident that warranted this RCA

Using an open discussion attempt to reach a consensus over a statement that describes the incident that warranted this RCA.  This must identify the thing (or things) that we don’t want to happen again (including all negative side-effects).  Don’t forget the impact on people e.g. having to work late to fix something.  Don’t forget to capture the problem from all perspectives.

Write this down somewhere everyone can see.

Step 5: Mark up the problem statement

Look at the problem statement and identify and underline every aspect of the statement that someone could ask “Why” about.  Try to take an outsider view, even if you know the answer or think something cannot be challenged, it is still in scope for being underlined.

Step 6: Perform the analysis

Document the “Why” question related to each underlined aspect in the problem statement.

For each “Why” question attempt to agree on one direct answer.  If you find you have more than one direct answer, split your “Why” question into enough more specific “Why” questions so that your answers can be correlated directly.

Mark up the answers as you did in Step 5.

Repeat this step until you’ve built up a tree with at least 5 answers per branch and at least 3 branches.  If you can’t find at least 3 branches, you need to ask more fundamental “Why” questions about your problem statement and answers.  If you can’t ask and answer more than 5 “Why”s per branch possibly you are taking too large steps.

Do not stop this process with any branch ending on a statement that could be classified “human error”.  (Refer to what we agreed at step 1).

Do not stop this process at something that could be described as a “third party error”.  Whilst the actions of third parties may not be directly under our control, we have to maintain a sense of accountability for the problem statement where if necessary we should have implemented measures to protect ourselves from the third party.

Step 7: Form Countermeasure Hypothesis

Review the end points of your analysis tree and make hypothesis’ about actions that could be taken to prevent future re-occurrences. Like all good hypothesis’ these should be specific and testable.

Use whatever mechanism you have for capturing and prioritising the proposed work to track the identified actions and get them implemented.  Use your normal approach to stating acceptance criteria and don’t close the actions unless they satisfy the tests that they have been effective.

 

Using ADOP and Docker to Learn Ansible

As I have written here, the DevOps Platform (aka ADOP) is an integration of open source tools that is designed to provide the tooling capability required for Continuous Delivery.  Through the concept of cartridges (plugins) ADOP also makes it very easy to re-use automation.

In this blog I will describe an ADOP Cartridge that I created as an easy way to experiment with Ansible.  Of course there are many other ways of experimenting with Ansible such as using Vagrant.  I chose to create an ADOP cartridge because ADOP is so easy to provision and predictable.  If you have an ADOP instance running you will be able to experience Ansible doing various interesting things in under 15 minutes.

To try this for yourself:

  1. Spin up and ADOP instance
  2. Load the Ansible 101 Cartridge (instructions)
  3. Run the jobs one-by-one and in each case read the console output.
  4. Re-run the jobs with different input parameters.

To anyone only loosely familiar with ADOP, Docker and Ansible, I recognise that this blog could be hard to follow so here is a quick diagram of what is going on.

docker-ansible

The Jenkins Jobs in the Cartridge

The jobs do the following things:

As the name suggests, this job just demonstrates how to install Ansible on Centos.  It installs Ansible in a Docker container in order to keep things simple and easy to clean up.  Having build a Docker image with Ansible installed, it tests the image just by running inside the container.

$ ansible --version

2_Run_Example_Adhoc_Commands

This job is a lot more interesting than the previous.  As the name suggests, the job is designed to run some adhoc Ansible commands (which is one of the first things you’ll do when learning Ansible).

Since the purpose of Ansible is infrastructure automation we first need to set up and environment to run commands against.  My idea was to set up an environment of Docker containers pretending to be servers.  In real life I don’t think we would ever want Ansible configuring running Docker containers (we normally want Docker containers to be immutable and certainly don’t want them to have ssh access enabled).  However I felt it a quick way to get started and create something repeatable and disposable.

The environment created resembles the diagram above.  As you can see we create two Docker containers (acting as servers) calling themselves web-node and one calling it’s self db-node.  The images already contain a public key (the same one vagrant uses actually) so that they can be ssh’d to (once again not good practice with Docker containers, but needed so that we can treat them like servers and use Ansible).  We then use an image which we refer to as the Ansible Control Container.  We create this image by installing Ansible installation and adding a Ansible hosts file that tells Ansible how to connect to the db and web “nodes” using the same key mentioned above.

With the environment in place the job runs the following ad hoc Ansible commands:

  1. ping all web nodes using the Ansible ping module: ansible web -m ping
  2. gather facts about the db node using the Ansible setup module: ansible db -m setup
  3. add a user to all web servers using the Ansible user module:  ansible web -b -m user -a “name=johnd comment=”John Doe” uid=1040″

By running the job and reading the console output you can see Ansible in action and then update the job to learn more.

3_Run_Your_Adhoc_Command

This job is identical to the job above in terms of setting up an environment to run Ansible.  However instead of having the hard-coded ad hoc Ansible commands listed above, it allows you to enter your own commands when running the job.  By default it pings all nodes:

ansible all -m ping

4_Run_A_Playbook

This job is identical to the job above in terms of setting up an environment to run Ansible.  However instead of passing in an ad hoc Ansible command, it lets you pass in an Ansible playbook to also run against the nodes.  By default the playbook that gets run installs Apache on the web nodes and PostgreSQL on the db node.  Of course you can change this to run any playbook you like so long as it is set to run on a host expression that matches: web-node-1, web-node-2, and/or db-node (or “all”).

How the jobs 2-4 work

To understand exactly how jobs 2-4 work, the code is reasonably well commented and should be fairly readable.  However, at a high-level the following steps are run:

  1. Create the Ansible inventory (hosts) file that our Ansible Control Container will need so that it can connect (ssh) to our db and web “nodes” to control them.
  2. Build the Docker image for our Ansible Control Container (install Ansible like the first Jenkins job, and then add the inventory file)
  3. Create a Docker network for our pretend server containers and our Ansible Control container to all run on.
  4. Create a docker-compose file for our pretend servers environment
  5. Use docker-compose to create our pretend servers environment
  6. Run the Ansible Control Container mounting in the Jenkins workspace if we want to run a local playbook file or if not just running the ad hoc Ansible command.

Conclusion

I hope this has been a useful read and has clarified a few things about Ansible, ADOP and Docker.  If you find this useful please star the GitHub repo and or share a pull request!

Bonus: here is an ADOP Platform Extension for Ansible Tower.

ADOP with Pivotal Cloud Foundry

As I have written here, the DevOps Platform (aka ADOP) is an integration of open source tools that is designed to provide the tooling capability required for Continuous Delivery.

In this blog I will describe integrating ADOP and the Cloud Foundry public PaaS from Pivotal.  Whilst it is of course technically possible to run all of the tools found in ADOP on Cloud Foundry, that wasn’t our intention.  Instead we wanted to combine the Continuous Delivery pipeline capabilities of ADOP with the industrial grade cloud first environments that Cloud Foundry offers.

Many ADOP cartridges for example the Java Petclinic one contain two Continuous Delivery pipelines:

  • The first to build and test the infrastructure code and build the Platform Application
  • The second to build and test the application code and deploy it to an environment built on the Platform Application.

The beauty of using a Public PaaS like Pivotal Cloud Foundry is that your platforms and environments are taken care of leaving you much more time to focus on the application code.  However you do of course still need to create an account and provision your environments.

  1. Register here
  2. Click Pivotal Web Services
  3. Create a free tier account
  4. Create and organisation
  5. Create one or more spaces

With this in place you are ready to:

  1. Spin up and ADOP instance
  2. Store your Cloud Foundry credentials in Jenkins’ Secure Store
  3. Load the Cloud Foundry Cartridge (instructions)
  4. Trigger the Continuous Delivery pipeline.

Having done all of this, the pipeline now does the following:

  1. Builds the code (which happens to be the JPetStore
  2. Runs the Unit Test and performs Static Code Analysis using SonarQube
  3. Deploys the code to an environment also known in Cloud Foundry as a Space
  4. Performs functional testing using Selenium and some security testing using OWASP ZAPP.
  5. Performs some performance testing using Gatling.
  6. Kills the running application in environment and waits to verify that Cloud Foundry automatically restores it.
  7. Deploys the application to a multi node Cloud Foundry environment.
  8. Kills one of the nodes in Cloud Foundry and validates that Cloud Foundry automatically avoids sending traffic to the killed node.

The beauty of ADOP is that all of this great Continuous Delivery automation is fully portable and can be loaded time and time again into any ADOP instance running on any cloud.

There is plenty more we could have done with the cartridge to really put the PaaS through its paces such as generating load and watching auto-scaling in action.  Everything is on Github, so pull requests will be warmly welcomed!  If you’ve tried to follow along but got stuck at all, please comment on this blog.

Abstraction is not Obsoletion – Abstraction is Survival

Successfully delivering Enterprise IT is a complicated, probably even complex problem.  What’s surprising, is that as an industry, many of us are still comfortable accepting so much of the problem as our own to manage.

Let’s consider an albeit very simplified and arguably imprecise view of The “full stack”:

  • Physical electrical characteristics of materials (e.g. copper / p-type silicon, …)
  • Electronic components (resistor, capacitor, transistor)
  • Integrated circuits
  • CPUs and storage
  • Hardware devices
  • Operating Systems
  • Assembly Language
  • Modern Software Languages
  • Middleware Software
  • Business Software Systems
  • Business Logic

When you examine this view, hopefully (irrespective of what you think about what’s included or missing and the order) it is clear that when we do “IT” we are already extremely comfortable being abstracted from detail. We are already fully ready to use things which we do not and may never understand. When we build an eCommerce Platform, an ERP, or CRM system, little thought it given to Electronic components for example.

My challenge to the industry as a whole is to recognise more openly the immense benefit of abstraction for which we are already entirely dependent and to embrace it even more urgently!

Here is my thinking:

  • Electrons are hard – we take them for granted
  • Integrated circuits are hard – so we take them for granted
  • Hardware devices (servers for example) are hard – so why are so many enterprises still buying and managing them?
  • The software that it takes to make servers useful for hosting an application is hard – so why are we still doing this by default?

For solutions that still involve writing code, the most extreme example of abstraction I’ve experienced so far is the Lambda service from AWS.  Some seem to have started calling such things ServerLess computing.

With Lambda you write your software functions and upload them ready for AWS to run for you. Then you configure the triggering event that would cause your function to run. Then you sit back and pay for the privilege whilst enjoying the benefits. Obviously if the benefits outweigh the cost for the service you are making money. (Or perhaps in the world of venture capital, if the benefits are generating lots of revenue or even just active users growth, for now you don’t care…)

Let’s take a mobile example. Anyone with enough time and dedication can sit at home on a laptop and start writing mobile applications. If they write it as a purely standalone, offline application, and charge a small fee for it, theoretically they can make enough money to retire-on without even knowing how to spell server.  But in practice most applications (even if they just rely on in app-adverts) require network enabled services. But for this our app developer still doesn’t need to spell server, they just need to use the API of the online add company e.g. Adwords and their app will start generating advertising revenue. Next perhaps the application relies on persisting data off the device or notifications to be pushed to it. The developer still only needs to use another API to do this, for example Parse can provide that to you all as a programming service.  You just use the software development kit and are completely abstracted from servers.

So why are so many enterprises still exposing themselves to so much of the “full stack” above?  I wonder how much inertia there was to integrated circuits in the 1950s and how many people argued against abstraction from transistors…

To survive is to embrace Abstraction!

 


[1] Abstraction in a general computer science sense not a mathematical one (as used by Joel Spolsky in his excellent Law of Leaky Abstractions blog.)

Join the DevOps Community Today!

As I’ve said in the past, if your organisation does not yet consider itself to be “doing DevOps” you should change that today.

If I was pushed to say the one thing I love most about the DevOps movement, it would be the sense of community and sharing.

I’ve never experienced anything like it previously in our industry.  It seems like everyone involved is united by being passionate about collaborating in as many ways as possible to improve:

  • the world through software
  • the rate at which we can do that
  • the lives of those working our industry.

The barrier to entry to this community is extremely low, for example you can:

You could also consider attending the DevOps Enterprise Summit London (DOES).  It’s the third DOES event and the first ever in Europe and is highly likely to be one of the most important professional development things you do this year.  Organised by Gene Kim (co-author of The Phoenix Project) and IT Revolution, the conference is highly focused on bringing together anyone interested in DevOps and providing them as much support as humanly possible in two days.  This involves presentations from some of the most advanced IT organisations in the world (aka unicorns), as well as many from those in traditional enterprises who may be on a very similar journey to you.   Already confirmed are talks from:

  • Rosalind Radcliffe talking about doing DevOps with Mainframe systems
  • Ron Van Kemenade CIO of ING Bank
  • Jason Cox about doing DevOps transformation at Disney
  • Scott Potter Head of New Engineering at News UK
  • And many more.

My recommendation is to get as many of your organisation along to the event as possible.  They won’t be disappointed.

Early bird tickets are available until 11th May 2016.

(Full disclosure – I’m a volunteer on the DOES London committee.)

London Banner logo_770x330

Reducing Continuous Delivery Impedance – Part 5: Learned Helplessness

Nearly two years ago, I started this blog series to describe the main challenges I’d experienced trying to implement Continuous Delivery.  At the time, the last post in the series was about four challenges related to people.  Since then I’ve observed a fifth challenge and discovered it has been studied in psychology and has a name.

In this post I’ll attempt to describe how to recognise and tackle Learned Helplessness.  Please share your comments (especially if my Psychology-by-Wikipedia needs guidance).

Through various interactions with clients, at meetups, conferences and even with my own team, I’ve witnessed the following phenomena:

  • Something is done (or not done) on an engagement that makes Continuous Delivery difficult (for example the development team accepting SonarQube saying some seriously defamatory things about their unit test coverage but neglecting even to gradually address this).
  • When questioned:
    • many people already appreciate that this is very wrong.
    • hardly anyone can really explain or justify why this is happening.
    • hardly anyone seems worked up about a solution.

It gave me an impression that people had experienced good practice in the past, but having joined this particular engagement had somehow lost the inclination to do it.  It’s possible that for some people, in the past when things just worked, they didn’t question it, so never really appreciated the value of particular practices.  But I think most people are more analytical than that.  I started to realise that people probably had gone through an experience like this:

  • Joined the engagement, didn’t understand why certain things were / weren’t done, but opted to observe before speaking up.
  • Realised things actually weren’t magically working in some new logic- / experience- defying way.
  • Spoke up but didn’t really get listened to.
  • Spoke up again several times , but didn’t really ever get listened to.
  • Gave up and accepted things for the sorry way that they are.

I figured there must be a name for this, started googling and realised it is called Learned Helplessness, something that was first experimented in the 1960’s by some scientists we can probably assume weren’t dog lovers…

The experiments are best described here on Wikipedia but in extremely simplified form:

  1. some dogs were given no random electric shocks,
  2. some dogs were given shocks and also given a button to press to disable the shocks,
  3. some dogs received shocks at the same time as group 2 dogs but had no button.  Group 3 dogs were paired with Group 2 dogs and were shocked until their Group 2 pair happened to press the button (which was at a random time from the Group 3 dog’s perspective).

The learned helplessness of Group 3 was demonstrated in the second part of the experiments when dogs had the opportunity to cross over a small wall to avoid getting shocks.  Whereas groups 1 and 2 quickly learned how to avoid shocks, group 3 all failed to learn and sat their accepting their fate in pain.

http://wariscrime.com/new/wp-content/uploads/2015/02/seligman-2.jpg

The similarity of the above diagram to diagrams about DevOps like this made me smile!

Subsequent experiments demonstrated the ineffectiveness of threats or even rewards on motivating group 3 to change their location.  Only by physically teaching the group 3 dogs to move more than twice did they learn to overcome the helplessness.  Later experiments also proved the same phenomena in humans (without electricity).

So how do we overcome this?

Here are some things I’m experimenting with:

  • Try some introspection – ask yourself what you’ve learnt to accept, really look around for things that are stopping your project going faster – no matter how obvious, and start to ask why, perhaps at least 5 times.
  • Ask others around you ideally at all levels of experience less, the same and more than you what they think is preventing learning and improvement and consider asking “5 Whys” with them.
  • Pay close attention to new joiners to your team – they are the only ones not yet infected by Learned Helplessness.
  • Be sensitive with people.  No-one wants to be told they are “helpless” or hear your amateur psychobabble.  Tread carefully.
  • If you are looking to impart a change, don’t over estimate the impact of threatening or incentivising the people who need to change – they may already be too apathetic.  Instead expect to need to show them multiple times:
    • That the proposed change is possible.  You need to demonstrate it to them (for example if it relates to Continuous Delivery something like the DevOps Platform may help make things real).
    • That their opinions count and they have an important voice.

How is Learned Helplessness harming your organisation and to what extent are you suffering?

 

Running the DevOps Platform on Microsoft Azure

As per my last post about GCE sometimes knowing something is possible just isn’t good enough.  So here is how I spun up the DevOps Platform on the Microsoft Azure cloud.  Warning thanks to Docker Machine, this post is very similar to this earlier one.

1. I needed an Azure account.

2. I logged into my Azure account and didn’t click “view the new Portal”.

3. On the left hand menu, I scrolled down to the bottom (it didn’t look immediately to me like it will scroll so hover) and clicked settings.  Here I was able to see my subscription ID and copy it.

4. (Having previously installed Docker Toolbox, see here) I opened Git Bash (as an Administrator) and ran this command:

$ docker-machine create --driver azure --azure-size Standard_A3 --azure-subscription-id <the ID I just copied> markos01

I was prompted to open a url in my brower, enter a confirmation code, and then login with my Azure credentials.  Credit to Microsoft, this was easier than GCE for which I needed to install the gcloud commandline utility!

You will notice that this is fairly standard.  I picked an Standard_A3 machine type which is roughly equivalent to what we use for AWS and GCP.

5. I waited while a machine was created in Azure containing Docker

6. I cloned the ADOP Docker Compose repository from GitHub:

$ git clone https://github.com/Accenture/adop-docker-compose
$ cd adop-docker-compose

7. I ran the normal startup.sh command as follows:

$ ./startup.sh -m markos01 -c NA

And entered a user name (thanks to this recent enhancement), hey presto

...
SUCCESS, your new ADOP instance is ready!
Run these commands in your shell:
eval \"$(docker-machine env $MACHINE_NAME)\"
source env.config.sh
Navigate to http://52.160.97.159 in your browser to use your new DevOps Platform!

And just to prove it:

$ whois 52.160.97.159 | grep Org
Organization: Microsoft Corporation (MSFT)
OrgName: Microsoft Corporation
OrgId: MSFT

8. I had to go to All resources > markos01-firewall > Inbound security rules and added a rule to allow HTTP to my server on port 80.

9. I viewed my new ADOP on Azure hosted instance in (of course…) Chrome! 😉

More lovely stuff!