How my team do root cause analysis

This blog is more or less a copy and paste of a wiki page that my team at work use as part of our Problem Management process.  It is heavily inspired by lots of good writing about blameless postmortems for example from Etsy and the Beyond Blame book.  Hope you find it useful.

RCA Approach

 

This page describes a 7 step approach to performing RCAs.  The process belongs to all of us, so please feel free to update it.

Traditionally RCA stands for Root Cause Analysis.  However, there are two problems with this:

  1. It implies there is one root cause.  In practice it is often a cocktail of contributing causes as well as negative (and sometimes positive) outcomes
  2. The name implies that we are on a hunt for a cause.  We are on a hunt for causes, but only to help us identify preventative actions.  Not just to solve a mystery or worse find an offender to punish.

Therefore RCA is proposed to stand for Recurrence Countermeasure Analysis.

Step 1: Establish “the motive”

Ask the following:

Question: Does anyone think anyone in our team did something deliberately malicious to cause this?  i.e. they consciously carried out actions that they knew would cause this or something of similar negative consequences or they clearly understood the risks but cared so little that they weren’t deterred?

and

Question: Does anyone think anyone outside our team… (as above).

The assumption here is that the answer is “NO” to both questions.  If it is “NO”, we can now proceed with a blameless manner, i.e. never stopping our analysis at a point where a person should (or could) have done something different.

If either answers are “YES”.  This is beyond the scope of this approach.

Step 2: Restate our meaning of “Blameless”

Read aloud the following to everyone participating in the RCA:

“We have established that we don’t blame any individual either internal or external to our organisation for the incident that has triggered this exercise.  Our process has failed us and needs our collective input to improve it.  If at any point during the process anyone starts to doubt this statement or act like they no longer believe it we must return to Step 1.  Everyone is responsible for enforcing this.

What is at stake here is not just getting to the bottom of this incident, it’s getting to the bottom of this incident and every future occurrence of the same incident.  If anyone feels mistreated by this process, by human nature they will take actions in the future to disguise their actions to limit blame and this will damage our ability to continuously improve.”

Step 3: Restate the rules

During this process we will follow these rules:

  1. Facts must not be subjective.  If an assertion of fact cannot be 100% validated we should agree and capture our confidence level (e.g. High, Medium, Low).  We must also capture the actions that we could do to validate it.
  2. If we don’t have enough facts, we will prioritise the facts that we need go away and validate before reconvening to continue.  Before suspending the process, agree a full list of “Things we wish we knew but don’t know”, capture the actions that we could do to validate them and prioritise the discovery.
  3. If anyone feels uncomfortable during the process due to:
    1. Blame
    2. Concerns with the process
    3. Language or tones of voice
    4. Their ability have their voice heard they must raise it immediately.
  4. We are looking for causes only to inform what we can do to prevent re-occurrence, not to apportion blame.

Step 4: Agree a statement to describe the incident that warranted this RCA

Using an open discussion attempt to reach a consensus over a statement that describes the incident that warranted this RCA.  This must identify the thing (or things) that we don’t want to happen again (including all negative side-effects).  Don’t forget the impact on people e.g. having to work late to fix something.  Don’t forget to capture the problem from all perspectives.

Write this down somewhere everyone can see.

Step 5: Mark up the problem statement

Look at the problem statement and identify and underline every aspect of the statement that someone could ask “Why” about.  Try to take an outsider view, even if you know the answer or think something cannot be challenged, it is still in scope for being underlined.

Step 6: Perform the analysis

Document the “Why” question related to each underlined aspect in the problem statement.

For each “Why” question attempt to agree on one direct answer.  If you find you have more than one direct answer, split your “Why” question into enough more specific “Why” questions so that your answers can be correlated directly.

Mark up the answers as you did in Step 5.

Repeat this step until you’ve built up a tree with at least 5 answers per branch and at least 3 branches.  If you can’t find at least 3 branches, you need to ask more fundamental “Why” questions about your problem statement and answers.  If you can’t ask and answer more than 5 “Why”s per branch possibly you are taking too large steps.

Do not stop this process with any branch ending on a statement that could be classified “human error”.  (Refer to what we agreed at step 1).

Do not stop this process at something that could be described as a “third party error”.  Whilst the actions of third parties may not be directly under our control, we have to maintain a sense of accountability for the problem statement where if necessary we should have implemented measures to protect ourselves from the third party.

Step 7: Form Countermeasure Hypothesis

Review the end points of your analysis tree and make hypothesis’ about actions that could be taken to prevent future re-occurrences. Like all good hypothesis’ these should be specific and testable.

Use whatever mechanism you have for capturing and prioritising the proposed work to track the identified actions and get them implemented.  Use your normal approach to stating acceptance criteria and don’t close the actions unless they satisfy the tests that they have been effective.

 

ADOP with Pivotal Cloud Foundry

As I have written here, the DevOps Platform (aka ADOP) is an integration of open source tools that is designed to provide the tooling capability required for Continuous Delivery.

In this blog I will describe integrating ADOP and the Cloud Foundry public PaaS from Pivotal.  Whilst it is of course technically possible to run all of the tools found in ADOP on Cloud Foundry, that wasn’t our intention.  Instead we wanted to combine the Continuous Delivery pipeline capabilities of ADOP with the industrial grade cloud first environments that Cloud Foundry offers.

Many ADOP cartridges for example the Java Petclinic one contain two Continuous Delivery pipelines:

  • The first to build and test the infrastructure code and build the Platform Application
  • The second to build and test the application code and deploy it to an environment built on the Platform Application.

The beauty of using a Public PaaS like Pivotal Cloud Foundry is that your platforms and environments are taken care of leaving you much more time to focus on the application code.  However you do of course still need to create an account and provision your environments.

  1. Register here
  2. Click Pivotal Web Services
  3. Create a free tier account
  4. Create and organisation
  5. Create one or more spaces

With this in place you are ready to:

  1. Spin up and ADOP instance
  2. Store your Cloud Foundry credentials in Jenkins’ Secure Store
  3. Load the Cloud Foundry Cartridge (instructions)
  4. Trigger the Continuous Delivery pipeline.

Having done all of this, the pipeline now does the following:

  1. Builds the code (which happens to be the JPetStore
  2. Runs the Unit Test and performs Static Code Analysis using SonarQube
  3. Deploys the code to an environment also known in Cloud Foundry as a Space
  4. Performs functional testing using Selenium and some security testing using OWASP ZAPP.
  5. Performs some performance testing using Gatling.
  6. Kills the running application in environment and waits to verify that Cloud Foundry automatically restores it.
  7. Deploys the application to a multi node Cloud Foundry environment.
  8. Kills one of the nodes in Cloud Foundry and validates that Cloud Foundry automatically avoids sending traffic to the killed node.

The beauty of ADOP is that all of this great Continuous Delivery automation is fully portable and can be loaded time and time again into any ADOP instance running on any cloud.

There is plenty more we could have done with the cartridge to really put the PaaS through its paces such as generating load and watching auto-scaling in action.  Everything is on Github, so pull requests will be warmly welcomed!  If you’ve tried to follow along but got stuck at all, please comment on this blog.

Join the DevOps Community Today!

As I’ve said in the past, if your organisation does not yet consider itself to be “doing DevOps” you should change that today.

If I was pushed to say the one thing I love most about the DevOps movement, it would be the sense of community and sharing.

I’ve never experienced anything like it previously in our industry.  It seems like everyone involved is united by being passionate about collaborating in as many ways as possible to improve:

  • the world through software
  • the rate at which we can do that
  • the lives of those working our industry.

The barrier to entry to this community is extremely low, for example you can:

You could also consider attending the DevOps Enterprise Summit London (DOES).  It’s the third DOES event and the first ever in Europe and is highly likely to be one of the most important professional development things you do this year.  Organised by Gene Kim (co-author of The Phoenix Project) and IT Revolution, the conference is highly focused on bringing together anyone interested in DevOps and providing them as much support as humanly possible in two days.  This involves presentations from some of the most advanced IT organisations in the world (aka unicorns), as well as many from those in traditional enterprises who may be on a very similar journey to you.   Already confirmed are talks from:

  • Rosalind Radcliffe talking about doing DevOps with Mainframe systems
  • Ron Van Kemenade CIO of ING Bank
  • Jason Cox about doing DevOps transformation at Disney
  • Scott Potter Head of New Engineering at News UK
  • And many more.

My recommendation is to get as many of your organisation along to the event as possible.  They won’t be disappointed.

Early bird tickets are available until 11th May 2016.

(Full disclosure – I’m a volunteer on the DOES London committee.)

London Banner logo_770x330

Apple Music versus Continuous Delivery

Last week my phone received iOS update 8.4.1,  Being into Software Release Management, naturally I read the release note.

On a slight tangent, I’m ashamed to say that when it comes to my phone, I don’t practice Continuous Deployment of app updates – even though I know it’s been possible since iOS 7. Instead, I like to also read those release notes before deploying (it’s not that bad – I don’t raise a Change Request or run a CAB!)

Anyway, the release note explained that the reason for an update was an update to Apple Music (a new music streaming software aimed to challenge Spotify.  At this point my Release Management instincts were offended:

I have to update my whole Operating System in order to update just one application?

And one application I don’t actually use.  (NB I actually don’t use it because I’m tied to Spotify through my phone contact, not because I’ve yet tried and/or rejected Apple Music.)  Even worse, this upgrade caused me downtime on my phone.

So I wonder:

How long Apple will continue to release a monolithic Operating System and Music Application build package?

Or interestingly do they have other reasons for doing this?  (I notice they did bundle security updates.)

Spotify have a standalone application (thanks to the App Store capable of Continuous Deployment of updates – for those so inclined) and they release very regular updates.  I can’t see people being happy with taking iOS Operating System updates as frequently (unless they can start happening without an outage).

In my opinion, whilst this situation continues Spotify have to posses a commercial advantage. Especially with all the great things we see and read about their agile culture.

Proposed Reference Architecture of a Platform Application (PaaA)

In this blog I’m going to propose a design for modelling a Platform Application as a series of generic layers.

I hope this model will be useful for anyone developing and operating a platform, in particular if they share my aspirations to treat the Platform as an Application and to:

Hold your Platform to the same engineering standards as a you would (should?!) your other applications

This is my fourth blog in a series where I’ve been exploring treating our Platforms as an Application (PaaA). My idea is simple, whether you are re-using a third Platform Application (e.g. Cloud Foundry) or rolling your own, you should:

  • Make sure it is reproducible from version control
  • Make sure you test it before releasing changes
  • Make sure you release only known and tested and reproducible versions
  • Industrialise and build a Continuous Delivery pipeline for you platform application
  • Industrialise and build a Continuous Delivery pipeline within your platform.

As I’ve suggested, if we are treating Business Applications as Products, we should also treat our Platform Application as a Product.  With approach in mind, clearly a Product Owner of a Business Application (e.g. a website) is not going be particularly interested in detail about how something like high-availability works.

A Platform Application should abstract the applications using it from many concerns which are important but not interesting to them.  

You could have a Product owner for the whole Platform Application, but that’s a lot to think about so I believe this reference architecture is a useful way to divide and conquer.  To further simply things, I’ve defined this anatomy in layers each of which abstracts next layer from the underlying implementation.

So here is it is:

PaaA_Anatomy

 

Starting from the bottom:

  • Hardware management 
    • Consists of: Hypervisor, Logical storage managers, Software defined network
    • The owner of this layer can makes the call: “I’ll use this hardware”
    • Abstracts you from: the hardware and allows you two work logically with compute, storage and network resources
    • Meaning: you can: within the limits of this layer e.g. physical capacity or performance consider hardware to be fully logical
    • Presents to the next layer: the ability to work with logical infrastructure
  • Basic Infrastructure orchestration 
    • Consists of: Cloud console and API equivalent. See this layer as described in Open Stack here
    • The owner of this layer can make the call: “I will use these APIs to interact with the Hardware Management layer.”
    • Abstracts you from: having to manually track usage levels of compute and storage. Monitor the hardware.
    • Meaning you can: perform operations on compute and storage in bulk using an API
    • Presents to the next layer: a convenient way to programmatically make bulk updates to what logical infrastructure has been provisioned
  • Platform Infrastructure orchestration (auto-scaling, resource usage optimisation)
    • Consists of: effectively a software application built to manage creation of the required infrastructure resources required. Holds the logic required for auto-scaling, auto-recovery and resource usage optimisation
    • The owner of this later can make the call: “I need this many servers of that size, and this storage, and this network”
    • Abstracts you from: manually creating the scaling required infrastructure and from changing this over time in response to demand levels
    • Meaning you can: expect that enough logical infrastructure will always be available for use
    • Presents to the next layer: the required amount of logical infrastructure resources to meet the requirements of the platform
  • Execution architecture 
    • Consists of: operating systems, containers, and middleware e.g. Web Application Server, RDBMS
    • The owner of this later can make the call: “This is how I will provide the runtime dependences that the Business Application needs to operate”
    • Abstracts you from: the software and configuration required your application to run
    • Meaning you can: know you have a resource that could receive release packages of code and run them
    • Presents to the next layer: the ability to create the software resources required to run the Business Applications
  • Logical environment separation
    • Consists of: logically separate and isolated instances of environments that can use to host a whole application by providing the required infrastructure resources and runtime dependencies
    • The owner of this layer can make the call: “This is what an environment consists of in terms of different execution architecture components and this is the required logical infrastructure scale”
    • Abstracts you from: working out what you need to create fully separate environments
    • Meaning you can: create environments
    • Presents to the next layer: logical environments (aka Spaces) where code can be deployed
  • Deployment architecture
    • Consists of: the orchestration and automation tools required release new Business Application releases to the Platform Application
    • The owner of this layer can make the call: “These are the tools I will use to deploy the application and configure it to work in the target logical environment”
    • Abstracts you from: the details about how to promote new versions of your application, static content, database and data
    • Meaning you can: release code to environments
    • Presents to the next layer: a user interface and API for releasing code
  • Security model
    • Consists of: a user directory, an authentication mechanism, an authorisation mechanism
    • The owner of this later can make the call: “These authorised people can do the make the following changes to all layers down to Platform Infrastructure Automation”
    • Abstracts you from: having to implement controls over platform use.
    • Meaning you can: empower the right people and be protected from the wrong people
    • Makes the call: “I want only authenticated and authorised users to be able to use my platform application”

I’d love to hear some feedback on this.  In the meantime, I’m planning to map some of the recent projects I’ve been involved with into this architecture to see how well they fit and what the challenges are..

PaaA is great for DevOps too: treat your Platform as a Product!

In this previous post, I chronicled my evolving understanding of PaaS and how it has taught me the virtues of treating your Platform as an Application (PaaA). Here I documented what I believe a self-respecting platform application should do.  In this post I’m going to describe how I’ve seen PaaA help solve the Dev and Ops “problem” in large organisations (“Traditional Enterprises” if you prefer).

DevOps is a highly used/abused term and here I’d like to define it as:

An organisational structure optimised for the fastest release of changes possible within a pre-defined level of acceptable risk associated with making changes. Or simply: the organisational structure that lets you release as fast as possible without losing control and messing up too badly.

This isn’t my first attempt at tackling DevOps teams, also see a blog here and an Ignite here. Of course lots of other good things have been written about it as well, e.g. here from Matt Skelton. I believe PaaA provides a good path.

So this is the traditional diagram for siloed Dev and Ops:

devops1

*Skull and crossbones denote issues which I won’t describe again here.  If you aren’t familiar with the standard story, I suggest viewing this excellent video by RackSpace.

For any organisation with more than one major application component (aka Product), when we add these to the diagram above it starts looking something like this:

devops2

Each application component (or Product) e.g. the website (Site) or the Content Management System (CMS) is affected by both silos. Obviously traditionally the Development (Dev) silo write the code, whilst the Operations (Ops) silo use part-automated processes to release, host, operate, and do whatever necessary to keep the application in service.  Whilst each “Business Application”  exists in both silo, only the Ops team have the pleasure of implementing and supporting “the Platform” i.e. the infrastructure and middleware.

So if silos are bad, perhaps the solution is the following. One giant converged team:

devops3

The problem with this is scale. There is a high likelihood that attempting to adopt this in practice actually fails and sub teams quickly form within it to re-enforce the original silos.

So we can look to subdivide this and make smaller combined Development and Operations teams per application component or small group of them. If that works for you, then fantastic!  This also is effectively the model you are already using when connecting your in-house application to any external or 3rd party web-services (for example Experian).

devops4

In my experience though, it is impractical and inappropriate to have so many different teams within one organisation each looking after their own Platform. Logically (as per the experience of public cloud) and physically (as per traditional data centres and private cloud) major elements of the platform are best to be shared e.g. for economies of scale, or perhaps for application performance.

So what about when you treat your Platform as an Application?  Where could Dev and Ops reside?

The optimum solution in my experience is a follows:

devopsPaaA

The Platform Application (highlighted above by a glowing yellow halo) has a dedicated and independent, fully-combined Development and Operations team and it is treated just like any other Business Application.

Hang on a minute, haven’t I just re-branded what would traditionally be just know as the Operations team as a Platform Application team?

Well no. Firstly the traditional Development team usually has no Operations duties such as following their code all the way to production and then being on call to support it once it is in there. They may not feel accountable for instrumentation and monitoring and operability, perhaps not even performance.  Now they must consider all of these and implement them within the constraints of the capabilities provided by the Platform Application upon which they depend.  By default nothing will be provided for them, it is for them to consume from the Platform Application.  So the Platform Application team are already alleviated of a lot of accountability compared to a traditional Operations team. So long as they can prove the Platform Application is available and meeting service levels, their pagers will not bother them.

Secondly, the platform team are no longer a quite so different from other end-to-end Business Application teams. They manage scope, they develop code, they manage dependencies, they measure quality, they can do Continuous Delivery and they must release they application just like anyone else.  Sure their application is extremely critical in that everyone else (all the products using the platform instance) depends on them, but managing dependencies is very important between Business Applications as well, so isn’t a new problem.

The Platform Application delivery team (which we could also call the Platform Product team) hey have to constantly recognise that their application has to provide a consistent experience to consuming Business Applications. One great technique for this (borrowing from “normal” applications is Semantic Versioning (SemVer) where every change made has to be labelled to provide a meaningful depiction of the compatibility the new version relative to the previous.  In Platform Application terms we can update the SemVer description as:

  1. MAJOR version when you expect consuming Business Applications to need changes e.g. you change the RDMS
  2. MINOR version when you don’t expect consuming Business Applications to break, but need full regression testing, e.g. configuration tuning or a security update
  3. PATCH version when you make backwards-compatible changes expected to have no or a very low change of external impact.  For example if the IaaS API has a change which the Platform Application fully abstracts Business Applications from.

Hopefully it is becoming clear how the powerful and effective the mentality of treating your Platform as an Application (or Product) can be.  Everything that has been invented to help deliver normal applications can be re-used and/or adapted for Platform Applications.  The pattern is also extremely conducive to switching to a public PaaS, in fact, it is exactly how you operate when using on one.

Full disclosure: I run an organisation that develop and manage multiple different Platform Applications for different enterprises.  I am most enthusiastic about this approach because I feel it reconciles a lot of the conventional wisdom around DevOps that I’ve heard about, with what I’ve seen first-hand to be extremely successful in my job working in “traditional Enterprises”.

Reducing Continuous Delivery Impedance – Part 4: People

This is my fourth post in a series I’m writing about impedance to doing continuous delivery.  Click these links for earlier parts about the trials of InfrastructureSolution Complexity and COTS products.

On the subject of Continuous Delivery where the intention is to fail fast, it’s actually rather sloppy of me to defer talking about people my fourth blog on this. When it comes to implementing Continuous Delivery there is nothing more potentially obstructive than people.  Or to put things more positively, nothing can have a more positive impact than people!

Here are my top 4 reason that people could cause impedance.

#1 Ignorance  A lack understanding and appreciation of Continuous Delivery even among small but perhaps vocal or influential minority can be a large source of impedance.  Many Developers and Operators (and a new species of cross-breeds!) have heard of Continuous Delivery and DevOps, but often Project Managers, Architects, Testers, Management/Leadership may not.  Continuous Delivery is like Agile in that it needs to be embraced by an organisation as a whole, simply because anyone in an organisation is capable of causing impedance by their actions and the decisions they make.  For example the timelines set by a project manager simply may not support taking time to automate.  A software package selected by an architect could cause a lot of pain to everyone with an interest in automation.

A solution to this that I’ve seen work well has been awareness sessions.  Whatever format that works best for sharing knowledge (brownbag lunches, webinars, communication sessions, memos, the pub etc) should be used to make people aware of what Continuous Delivery can do, how it works, why it is important, and what all the various terminology all means.

I once spent a week doing this and talked to around 10 different projects in an organisation and hundreds of people.  It was a very rewarding experience and by the end of it we’d gone from knowing 1 or 2 interested people to scores.  It was also great to make connections with people already starting to do great things towards the cause.  We even created a social media group to share ideas and research.

#2 Ambivalence?  As I’ve discussed before some people reject Continuous Delivery because they see it as un-achievable and / or inappropriate for their organisation. (Often I’ve seen this being due to confusion with Continuous Deployment.)  Also, don’t overlook a cultural aversion to automation. In my experience it’s only been around 5 years since the majority of people “in charge” were still very skeptical about the concept of automating the full software stack preferring.

A solution here (assuming you’ve revisited the awareness sessions where necessary) is to organise demos of any aspects of Continuous Delivery already adopted and demonstrate that it is real and already adding value.

#3 Obedience  Another source of impedance could perhaps be a misguided perception that Continuous Delivery is actually forbidden in a particular organisation.  So people will impede it due to a misinformed attempt at obedience to the management/leadership.  Perhaps a management steer to focus only tactically on “delivery, delivery, delivery” does not allow room for automation.  Or perhaps they take a very strong interest in how everything works and haven’t yet spoken about Continuous Delivery practices, or even oppose certain important techniques like Continuous Integration.  Or perhaps a leadership mandate to cut costs makes strategic tasks like automation seem frivolous or impossible.

A solution here is for management/leadership to publicly endorse Continuous Delivery and cite it as the core strategy / methodology for ongoing delivery.  Getting them along to the above mentioned training sessions can help a lot.  Getting them to blog about it is good.  As can be setting up demos with them to highlight the benefits of automation already developed.  Working Continuous Delivery into the recognition and rewards processes could also be effective (if you please C suite!).

#4 Disobedience  Finally, if people know what Continuous Delivery is, they want it, they know they are allowed it, why would they then disobey and not do it?  Firstly it could be down to other sources of impedance that make it difficult even for the most determined (e.g. Infrastructure).  But it could also easily be a lack of time or resources or budget or skills.

Skills are relatively easy to address so long as you make time.  Depending on where you live there could be masses of good MeetUps to go and learn at.  There are superb tutorials online for all of the open source tools.  #FreeNode is packed with good IRC channels of supportive individuals.  The list goes on.

Another thing to consider here is governance.  As I’ve confessed before, some people like me really like things like pipeline orchestration, configuration management, automated deployments etc. But this is not the norm.  It is very common for such concerns to be unloved and to slip through the cracks with no-one feeling accountable.  Making sure there is a clear owner for all of these is a very important step.  Personally I am always more than happy to take this accountability on projects as opposed to seeing them sit unloved and ignored.

Finally as I’ve said before, DevOps discussions often focus around the idea that an organisation has just two silos – Development and Operations.  But in my experience, things are usually lot more complex with multiple silos perhaps by technology, release, department etc., multiple vendors, multiple suppliers, you name it.  Putting a DevOps team in place to help get started towards Continuous Delivery can be one effective way of ensuring there is ownership, dedicated focus and skills ready to work with others to overcome people impedance.  Of course heed the warnings.

Obviously overall People Impedance is a huge subject.  I hope this has been of some use.  Please let me know your own experiences.