How my team do root cause analysis

This blog is more or less a copy and paste of a wiki page that my team at work use as part of our Problem Management process.  It is heavily inspired by lots of good writing about blameless postmortems for example from Etsy and the Beyond Blame book.  Hope you find it useful.

RCA Approach


This page describes a 7 step approach to performing RCAs.  The process belongs to all of us, so please feel free to update it.

Traditionally RCA stands for Root Cause Analysis.  However, there are two problems with this:

  1. It implies there is one root cause.  In practice it is often a cocktail of contributing causes as well as negative (and sometimes positive) outcomes
  2. The name implies that we are on a hunt for a cause.  We are on a hunt for causes, but only to help us identify preventative actions.  Not just to solve a mystery or worse find an offender to punish.

Therefore RCA is proposed to stand for Recurrence Countermeasure Analysis.

Step 1: Establish “the motive”

Ask the following:

Question: Does anyone think anyone in our team did something deliberately malicious to cause this?  i.e. they consciously carried out actions that they knew would cause this or something of similar negative consequences or they clearly understood the risks but cared so little that they weren’t deterred?


Question: Does anyone think anyone outside our team… (as above).

The assumption here is that the answer is “NO” to both questions.  If it is “NO”, we can now proceed with a blameless manner, i.e. never stopping our analysis at a point where a person should (or could) have done something different.

If either answers are “YES”.  This is beyond the scope of this approach.

Step 2: Restate our meaning of “Blameless”

Read aloud the following to everyone participating in the RCA:

“We have established that we don’t blame any individual either internal or external to our organisation for the incident that has triggered this exercise.  Our process has failed us and needs our collective input to improve it.  If at any point during the process anyone starts to doubt this statement or act like they no longer believe it we must return to Step 1.  Everyone is responsible for enforcing this.

What is at stake here is not just getting to the bottom of this incident, it’s getting to the bottom of this incident and every future occurrence of the same incident.  If anyone feels mistreated by this process, by human nature they will take actions in the future to disguise their actions to limit blame and this will damage our ability to continuously improve.”

Step 3: Restate the rules

During this process we will follow these rules:

  1. Facts must not be subjective.  If an assertion of fact cannot be 100% validated we should agree and capture our confidence level (e.g. High, Medium, Low).  We must also capture the actions that we could do to validate it.
  2. If we don’t have enough facts, we will prioritise the facts that we need go away and validate before reconvening to continue.  Before suspending the process, agree a full list of “Things we wish we knew but don’t know”, capture the actions that we could do to validate them and prioritise the discovery.
  3. If anyone feels uncomfortable during the process due to:
    1. Blame
    2. Concerns with the process
    3. Language or tones of voice
    4. Their ability have their voice heard they must raise it immediately.
  4. We are looking for causes only to inform what we can do to prevent re-occurrence, not to apportion blame.

Step 4: Agree a statement to describe the incident that warranted this RCA

Using an open discussion attempt to reach a consensus over a statement that describes the incident that warranted this RCA.  This must identify the thing (or things) that we don’t want to happen again (including all negative side-effects).  Don’t forget the impact on people e.g. having to work late to fix something.  Don’t forget to capture the problem from all perspectives.

Write this down somewhere everyone can see.

Step 5: Mark up the problem statement

Look at the problem statement and identify and underline every aspect of the statement that someone could ask “Why” about.  Try to take an outsider view, even if you know the answer or think something cannot be challenged, it is still in scope for being underlined.

Step 6: Perform the analysis

Document the “Why” question related to each underlined aspect in the problem statement.

For each “Why” question attempt to agree on one direct answer.  If you find you have more than one direct answer, split your “Why” question into enough more specific “Why” questions so that your answers can be correlated directly.

Mark up the answers as you did in Step 5.

Repeat this step until you’ve built up a tree with at least 5 answers per branch and at least 3 branches.  If you can’t find at least 3 branches, you need to ask more fundamental “Why” questions about your problem statement and answers.  If you can’t ask and answer more than 5 “Why”s per branch possibly you are taking too large steps.

Do not stop this process with any branch ending on a statement that could be classified “human error”.  (Refer to what we agreed at step 1).

Do not stop this process at something that could be described as a “third party error”.  Whilst the actions of third parties may not be directly under our control, we have to maintain a sense of accountability for the problem statement where if necessary we should have implemented measures to protect ourselves from the third party.

Step 7: Form Countermeasure Hypothesis

Review the end points of your analysis tree and make hypothesis’ about actions that could be taken to prevent future re-occurrences. Like all good hypothesis’ these should be specific and testable.

Use whatever mechanism you have for capturing and prioritising the proposed work to track the identified actions and get them implemented.  Use your normal approach to stating acceptance criteria and don’t close the actions unless they satisfy the tests that they have been effective.


Reducing Continuous Delivery Impedance – Part 5: Learned Helplessness

Nearly two years ago, I started this blog series to describe the main challenges I’d experienced trying to implement Continuous Delivery.  At the time, the last post in the series was about four challenges related to people.  Since then I’ve observed a fifth challenge and discovered it has been studied in psychology and has a name.

In this post I’ll attempt to describe how to recognise and tackle Learned Helplessness.  Please share your comments (especially if my Psychology-by-Wikipedia needs guidance).

Through various interactions with clients, at meetups, conferences and even with my own team, I’ve witnessed the following phenomena:

  • Something is done (or not done) on an engagement that makes Continuous Delivery difficult (for example the development team accepting SonarQube saying some seriously defamatory things about their unit test coverage but neglecting even to gradually address this).
  • When questioned:
    • many people already appreciate that this is very wrong.
    • hardly anyone can really explain or justify why this is happening.
    • hardly anyone seems worked up about a solution.

It gave me an impression that people had experienced good practice in the past, but having joined this particular engagement had somehow lost the inclination to do it.  It’s possible that for some people, in the past when things just worked, they didn’t question it, so never really appreciated the value of particular practices.  But I think most people are more analytical than that.  I started to realise that people probably had gone through an experience like this:

  • Joined the engagement, didn’t understand why certain things were / weren’t done, but opted to observe before speaking up.
  • Realised things actually weren’t magically working in some new logic- / experience- defying way.
  • Spoke up but didn’t really get listened to.
  • Spoke up again several times , but didn’t really ever get listened to.
  • Gave up and accepted things for the sorry way that they are.

I figured there must be a name for this, started googling and realised it is called Learned Helplessness, something that was first experimented in the 1960’s by some scientists we can probably assume weren’t dog lovers…

The experiments are best described here on Wikipedia but in extremely simplified form:

  1. some dogs were given no random electric shocks,
  2. some dogs were given shocks and also given a button to press to disable the shocks,
  3. some dogs received shocks at the same time as group 2 dogs but had no button.  Group 3 dogs were paired with Group 2 dogs and were shocked until their Group 2 pair happened to press the button (which was at a random time from the Group 3 dog’s perspective).

The learned helplessness of Group 3 was demonstrated in the second part of the experiments when dogs had the opportunity to cross over a small wall to avoid getting shocks.  Whereas groups 1 and 2 quickly learned how to avoid shocks, group 3 all failed to learn and sat their accepting their fate in pain.

The similarity of the above diagram to diagrams about DevOps like this made me smile!

Subsequent experiments demonstrated the ineffectiveness of threats or even rewards on motivating group 3 to change their location.  Only by physically teaching the group 3 dogs to move more than twice did they learn to overcome the helplessness.  Later experiments also proved the same phenomena in humans (without electricity).

So how do we overcome this?

Here are some things I’m experimenting with:

  • Try some introspection – ask yourself what you’ve learnt to accept, really look around for things that are stopping your project going faster – no matter how obvious, and start to ask why, perhaps at least 5 times.
  • Ask others around you ideally at all levels of experience less, the same and more than you what they think is preventing learning and improvement and consider asking “5 Whys” with them.
  • Pay close attention to new joiners to your team – they are the only ones not yet infected by Learned Helplessness.
  • Be sensitive with people.  No-one wants to be told they are “helpless” or hear your amateur psychobabble.  Tread carefully.
  • If you are looking to impart a change, don’t over estimate the impact of threatening or incentivising the people who need to change – they may already be too apathetic.  Instead expect to need to show them multiple times:
    • That the proposed change is possible.  You need to demonstrate it to them (for example if it relates to Continuous Delivery something like the DevOps Platform may help make things real).
    • That their opinions count and they have an important voice.

How is Learned Helplessness harming your organisation and to what extent are you suffering?


Running the DevOps Platform on Microsoft Azure

As per my last post about GCE sometimes knowing something is possible just isn’t good enough.  So here is how I spun up the DevOps Platform on the Microsoft Azure cloud.  Warning thanks to Docker Machine, this post is very similar to this earlier one.

1. I needed an Azure account.

2. I logged into my Azure account and didn’t click “view the new Portal”.

3. On the left hand menu, I scrolled down to the bottom (it didn’t look immediately to me like it will scroll so hover) and clicked settings.  Here I was able to see my subscription ID and copy it.

4. (Having previously installed Docker Toolbox, see here) I opened Git Bash (as an Administrator) and ran this command:

$ docker-machine create --driver azure --azure-size Standard_A3 --azure-subscription-id <the ID I just copied> markos01

I was prompted to open a url in my brower, enter a confirmation code, and then login with my Azure credentials.  Credit to Microsoft, this was easier than GCE for which I needed to install the gcloud commandline utility!

You will notice that this is fairly standard.  I picked an Standard_A3 machine type which is roughly equivalent to what we use for AWS and GCP.

5. I waited while a machine was created in Azure containing Docker

6. I cloned the ADOP Docker Compose repository from GitHub:

$ git clone
$ cd adop-docker-compose

7. I ran the normal command as follows:

$ ./ -m markos01 -c NA

And entered a user name (thanks to this recent enhancement), hey presto

SUCCESS, your new ADOP instance is ready!
Run these commands in your shell:
eval \"$(docker-machine env $MACHINE_NAME)\"
Navigate to in your browser to use your new DevOps Platform!

And just to prove it:

$ whois | grep Org
Organization: Microsoft Corporation (MSFT)
OrgName: Microsoft Corporation

8. I had to go to All resources > markos01-firewall > Inbound security rules and added a rule to allow HTTP to my server on port 80.

9. I viewed my new ADOP on Azure hosted instance in (of course…) Chrome! 😉

More lovely stuff!


Start Infrastructure Coding Today!

* Warning this post contains mildly anti-Windows sentiments *

It has never been easier to get ‘hands-on’ with Infrastructure Coding and Containers (yes including Docker), even if your daily life is spent using a Windows work laptop.  My friend Kumar and I proved this the other Saturday night in just one hour in a bar in Chennai.  Here are the steps we performed on his laptop.  I encourage you to do the same (with an optional side order of Kingfisher Ultra).


  1. We installed Docker Toolbox.
    It turns out this is an extremely fruitful first step as it gives you:

    1. Git (and in particular GitBash). This allows you to use the world’s best Software Configuration Management tool Git and welcomes you into the world of being able to use and contribute to Open Source software on Git Hub.  Plus it has the added bonus of turning  your laptop into something which understands good wholesome Linux commands.
    2. Virtual Box. This is a hypervisor that turns your laptop from being one machine running one Operating System (Windoze) into something capable of running multiple virtual machines with almost any Operating System you want (even UniKernels!).  Suddenly you can run (and develop) local copies of servers that from a software perspective match Production.
    3. Docker Machine. This is a command line utility that will create virtual machines for running Docker on.  It can do this either locally on your shiny new Virtual Box instance or remotely in the cloud (even the Azure cloud – Linux machines of course)
    4. Docker command line. This is the main command line utility of Docker.  This will enable you to download and build Docker images, and turn them into running Docker containers.  The beauty of the Docker command line is that you can run it locally (ideally in GitBash) on your local machine and have it control Docker running on a Linux machine.  See diagram below.
    5. Docker Compose. This is a utility that gives you the ability to run and associate multiple Docker containers by reading what is required from a text file.DockerVB
  2. Having completed step 1, we opened up the Docker Quickstart Terminal by clicking the entry that had appeared in the Windows start menu. This runs a shell script via GitBash that performs the following:
    1. Creates a virtual box machine (called ‘default’) and starts it
    2. Installs Docker on the new virtual machine
    3. Leaves you with a GitBash window open that has the necessary environment variables set to instruct point Docker command line utility to point at your new virtual machine.
  3. We wanted to test things out, so we ran:
    $ docker ps –a


    This showed us that our Docker command line tool was successfully talking to the Docker daemon (process) running on the ‘default’ virtual machine. And it showed us that no containers were either running or stopped on there.

  4. We wanted to testing things a little further so ran:
    $ docker run hello-world
    Hello from Docker.
    This message shows that your installation appears to be working correctly.
    To generate this message, Docker took the following steps:
    The Docker client contacted the Docker daemon.
    The Docker daemon pulled the "hello-world" image from the Docker Hub.
    The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
    The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.
    To try something more ambitious, you can run an Ubuntu container with:
    $ docker run -it ubuntu bash
    Share images, automate workflows, and more with a free Docker Hub account:
    For more examples and ideas, visit:


    The output is very self-explanatory.  So I recommend reading it now.

  5. We followed the instructions above to run a container from the Ubuntu image.  This started for us a container running Ubuntu and we ran a command to satisfy ourselves that we were running Ubuntu.  Note one slight modification, we had to prefix the command with ‘winpty’ to work around a tty-related issue in GitBash
    $ winpty docker run -it ubuntu bash
    root@2af72758e8a9:/# apt-get -v | head -1
    apt 1.0.1ubuntu2 for amd64 compiled on Aug  1 2015 19:20:48
    root@2af72758e8a9:/# exit
    $ exit


  6. We wanted to run something else, so we ran:
    $ docker run -d -P nginx:latest


  7. This caused the Docker command line to do more or less what is stated in the previous step with a few exceptions.
    • The –d flag caused the container to run in the background (we didn’t need –it).
    • The –P flag caused docker to expose the ports of Nginx back to our Windows machine.
    • The Image was Nginx rather than Ubuntu.  We didn’t need to specify a command for the container to run after starting (leaving it to run its default command).
  8. We then ran the following to establish how to connect to our Nginx:
    $ docker-machine ip default
     $ docker ps
    CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                           NAMES
    826827727fbf        nginx:latest        "nginx -g 'daemon off"   14 minutes ago      Up 14 minutes>80/tcp,>443/tcp   ecstatic_einstein


  9. We opened a proper web brower (Chrome) and navigated to: using the information above (your IP address may differ). Pleasingly we were presented with the: ‘Welcome to nginx!’ default page.
  10. We decided to clean up some of what we’re created locally on the virtual machine, so we ran the following to:
    1. Stop the Nginx container
    2. Delete the stopped containers
    3. Demonstrate that we still had the Docker ‘images’ downloaded


$ docker kill `docker ps -q`

$ docker rm `docker ps -aq`




$ docker ps -a

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

$ docker images

REPOSITORY                     TAG                 IMAGE ID            CREATED             VIRTUAL SIZE

nginx                          latest              sha256:99e9a        4 weeks ago         134.5 MB

ubuntu                         latest              sha256:3876b        5 weeks ago         187.9 MB

hello-world                    latest              sha256:690ed        4 months ago        960 B



  1. We went back to Chrome and hit refresh. As expected Nginx was gone.
  2. We opened Oracle VM Virtual box from the Windows start machine so that we could observe our ‘default’ machine listed as running.
  3. We ran the following to stop our ‘default’ machine and also observed it then stopping Virtual Box:
    $ docker-machine stop default


  4. Finally we installed Vagrant. This is essentially a much more generic version of Docker-Machine that is capable of creating not just virtual machines in Virtual Box for Docker, but for many other purposes.  For example from an Infrastructure Coding perspective, you might run a virtual machine for developing Chef code.


Not bad for one hour on hotel wifi!

Kumar keenly agreed he would complete the following next steps.  I hope you’ll join him on the journey and Start Infrastructure Coding Today!

  1. Learn Git. It really only takes 10 minutes with this tutorial LINK to learn the basics.
  2. Docker – continue the journey here
  3. Vagrant
  4. Chef
  5. Ansible


Please share any issues following this and I’ll improve the instructions.  Please share  any other useful tutorials and I will add those also.

Neither Carrot Nor Stick

Often when we talk about motivating people, the idiom of having the choice of using a Carrot or a Stick is used. I believe this originates from the conventional wisdom about the best ways to get a mule (as in the four legged horse-like animal) to move. You could try using a carrot, which might be enough of a treat for the mule to move in order to reach it. Or you could try a stick, which might be enough of a threat to get the mule to move in order to avoid being hit.

The idiom works because the carrot is analogous to offering someone an incentive (such as pay rises or bonuses) to get them to do something. The stick is analogous to offering them the threat of punishment (such as being fired or demoted). It’s curious how threat and treat differ by just one letter…

This all makes sense for a horse but not really for people.

The idiom has a major flaw because humans are significantly more complex than animals (all of us!).

Instead if we want to influence someone effectively and sustain-ably, we need to think about how to help them to have an emotional attachment to the thing your are looking to achieve.

I think this comes down to the following:

  • Being open exploring both their and your personal motivations with a view to maximising the achievement of both – in particular the overlap.
  • Starting from an open mind and only looking to agree the desired outcome. This is not the same as agreeing the approach. The approach is key to the satisfaction and motivation of the implementer and key to their attachment to achieving a great solution.
  • Supporting them in their chosen approach taking care not to challenge unnecessarily or do anything that risks eroding their sense of your trust.
  • Being transparent about the consequences of not delivering the desired outcome and clarifying your own role in shielding them from blame and creating a safe environment to operate.

Of course these ideas are not my own.  I would encourage you to explore some of these great materials that I have taken inspiration from:

And I’d love to hear your own ideas and recommended reading.


New Directions in Operating Systems: Designer Cows and Intensive Farming

On Tuesday I attended the inaugural New Directions in Operating Systems conference in London which was excellently organised by Justin Cormack, and sponsored by :ByteMark (no relation to me!) and RedHat.

I attended with the attitude that since things seem to be moving so fast in this space (e.g. Docker), there would be a high chance that I would get glimpses of the not-too-distant future.  I was not disappointed.

I’m not going to cover every talk in detail. For that, I recommend reading this which someone from Cambridge somehow managed to live blog.  Also most of the presentation links are now up here and I expect videos will follow.

Instead, here two main highlights that I took away.

Designer Cows

If we are aspiring to treat our servers as cattle (as per the popular metaphor from CERN), a number of talks were (to me) about how to make better cows.

The foundation of all solutions was unanimously either bare metal or a XEN hypervisor.  As per the name, this wasn’t the conference for talking about the Open Compute Project or advances like AWS have made recently with C4.  We can think about the hypervisor as our “field” in the cow metaphor.

For the sake of my vegetarian and Indian friends, let’s say the reason for owning cows is to get milk.  Cows (like the servers we use today) have evolved significantly (under our influence) and cows are now very good at producing milk.  But they still have a lot of original “features” not all of which may directly serve our requirements.  For example they can run, they can moo, they can hear etc.

A parallel can be drawn to a web server which similarly may possess its own “redundant features”.  Most web servers can talk the language required by printers, support multiple locales and languages, and can be connected to using a number of different protocols.  There could even be redundancy in traditional “core features” for example supporting multiple users, multiple threads, or even virtual memory.

The downside of all of this redundancy is not just efficiency of storage, processing and maintenance, it’s also the impact on security.  All good sysadmins will understand that unnecessary daemons running on a box (e.g. hopefully even sshd) expand your attack surface by exposing additional attack vectors.  This principle can be taken further.  As Antti Kantee said, drivers can add millions of lines of code to your server.  Every one of these presents the potential to a security defect.

Robert Watson was amongst others to quote Bruce Schneier:

Defenders have to protect against every possible vulnerability, but an attacker only has to find one security flaw to compromise the whole system.

With HeartBleed, Shellshock and Poodle all in the last few months, this is clearly needs some serious attention (for the good of all of us!).

To address this we saw demonstrations of:

  • Rump Kernels which are stripped down to only include filesystems, TCP/IP, device calls and system calls, device drivers.  But no threads, locking, scheduling etc.  So this is more of a basis from which to build a cow but not a working one.
  • Unikernels where all software layers are from the same language framework.  So this is building a whole working cow from bespoke parts with no redundant parts (ears etc!).
  • RumpRun for building Unikernels for running a POSIX application (mathopd a http server in the example), i.e. taking a Rump Kernel and building it into a single application kernel for one single job application.  So another way to build a bespoke cow.
  • MirageOS a programming language for building type-safe Unikernels.  So another way to build a bespoke very safe cow.
  • GenodeOS a completely new operating system taking a new approach to delegating trust through the application stack.  So to some extent a new animal that produces milk with a completely re-conceived anatomy.

Use cases range from the “traditional” like building normal (but much more secure) servers to completely novel such as very light-weight and short-lived servers to start up almost instantaneously, do something quickly and disappear.

Docker and CoreOs with its multi virtual machine-aware and now stripped down container ready functionality were mentioned.  However, whilst CoreOs is as smaller attack surface, if you are running a potentially quite big Docker container on it, you may be adding back lots security vectors.  Possibly as the dependency resolution algorithm for Docker improves, this will progressively reduce the size of Docker containers and hence the number of lines of code and potential vulnerabilities included.

Intensive Farming

Two presentations of the day stood out for generally focussing on a different (and to me more recognisable) level of problem.  First was Gareth Rushgrove’s about Configuration Management.  He covered a very wide range of concepts and tools focused on the management of the configuration of single and fleets of servers over time rather than novel ways to construct operating systems.  He made the statement:

If servers are cattle not pets, we need to talk about fields and farms

Which inspired the title of this blog and led to some discussion (during the presentation and on Twitter) about using active runtime Configuration Management tools like Puppet to manage the adding and removing of infrastructure resources over time.  Even if most of your servers are immutable, it’s quite appealing to think Puppet or Chef could manage both what servers exist and the state of those those more pet-like creatures that do have change over time (in the applications that most good “horse” organisations run).

Whilst if you are using AWS CloudFormation it can provision and update your environment (so called Stack), resultant changes may be heavy-handed and this is clearly a single cloud provider solution.  Terraform is multi cloud provider solution to consider and supports a good preview mode, but doesn’t evolve the configuration on your servers over time.

Gareth also mentioned:

  • OSv which at first I thought was just a operating system query engine like Facebook’s osquery.  But it appears to be a fully API driven operating system.
  • Atomic and OSTree which Michael Scherer covered in the next talk. These look like very interesting solutions for providing confidence and integrity in those bits of the application and operating system that aren’t controlled by Chef, Puppet or DockerFile.

I really feel like I’ve barely done justice to describing even 20% of this excellent conference.  Look out for the videos and look out for the next event.

No animals were harmed during the making of the conference or this blog.

Proposed Reference Architecture of a Platform Application (PaaA)

In this blog I’m going to propose a design for modelling a Platform Application as a series of generic layers.

I hope this model will be useful for anyone developing and operating a platform, in particular if they share my aspirations to treat the Platform as an Application and to:

Hold your Platform to the same engineering standards as a you would (should?!) your other applications

This is my fourth blog in a series where I’ve been exploring treating our Platforms as an Application (PaaA). My idea is simple, whether you are re-using a third Platform Application (e.g. Cloud Foundry) or rolling your own, you should:

  • Make sure it is reproducible from version control
  • Make sure you test it before releasing changes
  • Make sure you release only known and tested and reproducible versions
  • Industrialise and build a Continuous Delivery pipeline for you platform application
  • Industrialise and build a Continuous Delivery pipeline within your platform.

As I’ve suggested, if we are treating Business Applications as Products, we should also treat our Platform Application as a Product.  With approach in mind, clearly a Product Owner of a Business Application (e.g. a website) is not going be particularly interested in detail about how something like high-availability works.

A Platform Application should abstract the applications using it from many concerns which are important but not interesting to them.  

You could have a Product owner for the whole Platform Application, but that’s a lot to think about so I believe this reference architecture is a useful way to divide and conquer.  To further simply things, I’ve defined this anatomy in layers each of which abstracts next layer from the underlying implementation.

So here is it is:



Starting from the bottom:

  • Hardware management 
    • Consists of: Hypervisor, Logical storage managers, Software defined network
    • The owner of this layer can makes the call: “I’ll use this hardware”
    • Abstracts you from: the hardware and allows you two work logically with compute, storage and network resources
    • Meaning: you can: within the limits of this layer e.g. physical capacity or performance consider hardware to be fully logical
    • Presents to the next layer: the ability to work with logical infrastructure
  • Basic Infrastructure orchestration 
    • Consists of: Cloud console and API equivalent. See this layer as described in Open Stack here
    • The owner of this layer can make the call: “I will use these APIs to interact with the Hardware Management layer.”
    • Abstracts you from: having to manually track usage levels of compute and storage. Monitor the hardware.
    • Meaning you can: perform operations on compute and storage in bulk using an API
    • Presents to the next layer: a convenient way to programmatically make bulk updates to what logical infrastructure has been provisioned
  • Platform Infrastructure orchestration (auto-scaling, resource usage optimisation)
    • Consists of: effectively a software application built to manage creation of the required infrastructure resources required. Holds the logic required for auto-scaling, auto-recovery and resource usage optimisation
    • The owner of this later can make the call: “I need this many servers of that size, and this storage, and this network”
    • Abstracts you from: manually creating the scaling required infrastructure and from changing this over time in response to demand levels
    • Meaning you can: expect that enough logical infrastructure will always be available for use
    • Presents to the next layer: the required amount of logical infrastructure resources to meet the requirements of the platform
  • Execution architecture 
    • Consists of: operating systems, containers, and middleware e.g. Web Application Server, RDBMS
    • The owner of this later can make the call: “This is how I will provide the runtime dependences that the Business Application needs to operate”
    • Abstracts you from: the software and configuration required your application to run
    • Meaning you can: know you have a resource that could receive release packages of code and run them
    • Presents to the next layer: the ability to create the software resources required to run the Business Applications
  • Logical environment separation
    • Consists of: logically separate and isolated instances of environments that can use to host a whole application by providing the required infrastructure resources and runtime dependencies
    • The owner of this layer can make the call: “This is what an environment consists of in terms of different execution architecture components and this is the required logical infrastructure scale”
    • Abstracts you from: working out what you need to create fully separate environments
    • Meaning you can: create environments
    • Presents to the next layer: logical environments (aka Spaces) where code can be deployed
  • Deployment architecture
    • Consists of: the orchestration and automation tools required release new Business Application releases to the Platform Application
    • The owner of this layer can make the call: “These are the tools I will use to deploy the application and configure it to work in the target logical environment”
    • Abstracts you from: the details about how to promote new versions of your application, static content, database and data
    • Meaning you can: release code to environments
    • Presents to the next layer: a user interface and API for releasing code
  • Security model
    • Consists of: a user directory, an authentication mechanism, an authorisation mechanism
    • The owner of this later can make the call: “These authorised people can do the make the following changes to all layers down to Platform Infrastructure Automation”
    • Abstracts you from: having to implement controls over platform use.
    • Meaning you can: empower the right people and be protected from the wrong people
    • Makes the call: “I want only authenticated and authorised users to be able to use my platform application”

I’d love to hear some feedback on this.  In the meantime, I’m planning to map some of the recent projects I’ve been involved with into this architecture to see how well they fit and what the challenges are..