Transforming the future together

We have chosen “Transforming the future together” as the theme for the DevOpsDays London 2017 Conference.  I wanted to share here some personal thoughts about what it means.  These may not be the same views of the other organisers (including Bob Walker who proposed the theme).  If you are reading this before 31st May 2017 there is still time for you to submit a proposal to talk!

One of the things I find most inspiring about the DevOps community is the level of sharing, not just in terms of Open Source software, but in terms of strong views on culture, processes, metrics and usage of technology.  I am especially excited by the theme because of the potential for expanding what we share.

ArcaBoard_Dumitru_Popescu

If we are going to transform the future I think to an extent we have to hold an interpretation of:

  • where we are today,
  • where the future might be heading without our intervention,
  • how we would like it to be, and
  • what we might do to influence it.

Even standing alone I can imagine each of these having lots of potential.

The great thing about hearing about the experiences of others is that it can be a shortcut to discovering things without having to try them yourselves.  For example you are walking in the country and you pass someone covered in mud “The middle of this track is much more slippery that it looks”, they volunteer with a pained smile.  Whether you choose to take the side of the path or test your own ability to brave the slippery part, you are much better off for the experience.

I hope to see some talks at DevOpsDays London that attempt to describe with great balance generally where they are today as technology organisations.   Not just how many times a day they can change static wording on their website, but also how fast they change for example their Oracle packaged products and what their bottlenecks are.  

“The future is already here — it’s just not very evenly distributed.”  William Gibson

Naturally everyone is at a different stage of their DevOps journey.  I’m hoping there will be a good range of talks reflecting this variety.

I would love to hear some predictions optimistic and pessimistic about where we might be and where we should be heading.  How will the world change when we cross 50% of people being online?  Are robots really coming for our jobs or our lives? Is blockchain really going to find widespread usage?  Will the private data centre be marginalised as much as private electricity generators have been?  Will Platform-as-a-Service finally gain widespread traction, will it be eclipsed by ServerLess?  Will 100% remote working liberate more of us from living near expensive cities?  Will the debate about what is and isn’t Agile ever end?  What is coming after Micro Services?  Will the gender balance in our industry become more equal?  Can holacracies succeed as a form of organisation?  These questions are of course just the start.

Finally what are people doing now towards achieving some of the above.  What technologies are really starting to change the game?  What new ways of working and techniques are really making an impact (positive or negative).  How are they working to be more inclusive and improve diversity?  Are they actually liking bi-modal IT?  What are they doing to alleviate their bottlenecks?

Please share your own questions that you would like to see answered and once again please consider submitting a proposal.

Image: https://en.wikipedia.org/wiki/File:ArcaBoard_Dumitru_Popescu.jpg

How Psychologically Safe does your team feel?

As per this article, Google conducted a two-year long study into what makes their best teams great and found psychological safety to be the most important factor.

As per Wikipedia, psychological safety can be defined as:

“feeling able to show and employ one’s self without fear of negative consequences of self-image, status or career”

It certainly seems logical to me that creating a safe working environment where people are free to share their individual opinions and avoid group think, is highly important.

So the key question is how can you foster psychological safety?

Some of the best advice I’ve read was from this this blog by Steven M Smith.  He suggests performing the paper-based voting exercise to measure safety.

Whilst we’ve found this to be good technique, the act of screwing up bits of paper is tedious and hard to do remotely.  Hence we’ve created an online tool:

https://safetychecker.herokuapp.com/

Please give it a go and share your experiences!

How my team do root cause analysis

This blog is more or less a copy and paste of a wiki page that my team at work use as part of our Problem Management process.  It is heavily inspired by lots of good writing about blameless postmortems for example from Etsy and the Beyond Blame book.  Hope you find it useful.

RCA Approach

 

This page describes a 7 step approach to performing RCAs.  The process belongs to all of us, so please feel free to update it.

Traditionally RCA stands for Root Cause Analysis.  However, there are two problems with this:

  1. It implies there is one root cause.  In practice it is often a cocktail of contributing causes as well as negative (and sometimes positive) outcomes
  2. The name implies that we are on a hunt for a cause.  We are on a hunt for causes, but only to help us identify preventative actions.  Not just to solve a mystery or worse find an offender to punish.

Therefore RCA is proposed to stand for Recurrence Countermeasure Analysis.

Step 1: Establish “the motive”

Ask the following:

Question: Does anyone think anyone in our team did something deliberately malicious to cause this?  i.e. they consciously carried out actions that they knew would cause this or something of similar negative consequences or they clearly understood the risks but cared so little that they weren’t deterred?

and

Question: Does anyone think anyone outside our team… (as above).

The assumption here is that the answer is “NO” to both questions.  If it is “NO”, we can now proceed with a blameless manner, i.e. never stopping our analysis at a point where a person should (or could) have done something different.

If either answers are “YES”.  This is beyond the scope of this approach.

Step 2: Restate our meaning of “Blameless”

Read aloud the following to everyone participating in the RCA:

“We have established that we don’t blame any individual either internal or external to our organisation for the incident that has triggered this exercise.  Our process has failed us and needs our collective input to improve it.  If at any point during the process anyone starts to doubt this statement or act like they no longer believe it we must return to Step 1.  Everyone is responsible for enforcing this.

What is at stake here is not just getting to the bottom of this incident, it’s getting to the bottom of this incident and every future occurrence of the same incident.  If anyone feels mistreated by this process, by human nature they will take actions in the future to disguise their actions to limit blame and this will damage our ability to continuously improve.”

Step 3: Restate the rules

During this process we will follow these rules:

  1. Facts must not be subjective.  If an assertion of fact cannot be 100% validated we should agree and capture our confidence level (e.g. High, Medium, Low).  We must also capture the actions that we could do to validate it.
  2. If we don’t have enough facts, we will prioritise the facts that we need go away and validate before reconvening to continue.  Before suspending the process, agree a full list of “Things we wish we knew but don’t know”, capture the actions that we could do to validate them and prioritise the discovery.
  3. If anyone feels uncomfortable during the process due to:
    1. Blame
    2. Concerns with the process
    3. Language or tones of voice
    4. Their ability have their voice heard they must raise it immediately.
  4. We are looking for causes only to inform what we can do to prevent re-occurrence, not to apportion blame.

Step 4: Agree a statement to describe the incident that warranted this RCA

Using an open discussion attempt to reach a consensus over a statement that describes the incident that warranted this RCA.  This must identify the thing (or things) that we don’t want to happen again (including all negative side-effects).  Don’t forget the impact on people e.g. having to work late to fix something.  Don’t forget to capture the problem from all perspectives.

Write this down somewhere everyone can see.

Step 5: Mark up the problem statement

Look at the problem statement and identify and underline every aspect of the statement that someone could ask “Why” about.  Try to take an outsider view, even if you know the answer or think something cannot be challenged, it is still in scope for being underlined.

Step 6: Perform the analysis

Document the “Why” question related to each underlined aspect in the problem statement.

For each “Why” question attempt to agree on one direct answer.  If you find you have more than one direct answer, split your “Why” question into enough more specific “Why” questions so that your answers can be correlated directly.

Mark up the answers as you did in Step 5.

Repeat this step until you’ve built up a tree with at least 5 answers per branch and at least 3 branches.  If you can’t find at least 3 branches, you need to ask more fundamental “Why” questions about your problem statement and answers.  If you can’t ask and answer more than 5 “Why”s per branch possibly you are taking too large steps.

Do not stop this process with any branch ending on a statement that could be classified “human error”.  (Refer to what we agreed at step 1).

Do not stop this process at something that could be described as a “third party error”.  Whilst the actions of third parties may not be directly under our control, we have to maintain a sense of accountability for the problem statement where if necessary we should have implemented measures to protect ourselves from the third party.

Step 7: Form Countermeasure Hypothesis

Review the end points of your analysis tree and make hypothesis’ about actions that could be taken to prevent future re-occurrences. Like all good hypothesis’ these should be specific and testable.

Use whatever mechanism you have for capturing and prioritising the proposed work to track the identified actions and get them implemented.  Use your normal approach to stating acceptance criteria and don’t close the actions unless they satisfy the tests that they have been effective.

 

Using ADOP and Docker to Learn Ansible

As I have written here, the DevOps Platform (aka ADOP) is an integration of open source tools that is designed to provide the tooling capability required for Continuous Delivery.  Through the concept of cartridges (plugins) ADOP also makes it very easy to re-use automation.

In this blog I will describe an ADOP Cartridge that I created as an easy way to experiment with Ansible.  Of course there are many other ways of experimenting with Ansible such as using Vagrant.  I chose to create an ADOP cartridge because ADOP is so easy to provision and predictable.  If you have an ADOP instance running you will be able to experience Ansible doing various interesting things in under 15 minutes.

To try this for yourself:

  1. Spin up and ADOP instance
  2. Load the Ansible 101 Cartridge (instructions)
  3. Run the jobs one-by-one and in each case read the console output.
  4. Re-run the jobs with different input parameters.

To anyone only loosely familiar with ADOP, Docker and Ansible, I recognise that this blog could be hard to follow so here is a quick diagram of what is going on.

docker-ansible

The Jenkins Jobs in the Cartridge

The jobs do the following things:

As the name suggests, this job just demonstrates how to install Ansible on Centos.  It installs Ansible in a Docker container in order to keep things simple and easy to clean up.  Having build a Docker image with Ansible installed, it tests the image just by running inside the container.

$ ansible --version

2_Run_Example_Adhoc_Commands

This job is a lot more interesting than the previous.  As the name suggests, the job is designed to run some adhoc Ansible commands (which is one of the first things you’ll do when learning Ansible).

Since the purpose of Ansible is infrastructure automation we first need to set up and environment to run commands against.  My idea was to set up an environment of Docker containers pretending to be servers.  In real life I don’t think we would ever want Ansible configuring running Docker containers (we normally want Docker containers to be immutable and certainly don’t want them to have ssh access enabled).  However I felt it a quick way to get started and create something repeatable and disposable.

The environment created resembles the diagram above.  As you can see we create two Docker containers (acting as servers) calling themselves web-node and one calling it’s self db-node.  The images already contain a public key (the same one vagrant uses actually) so that they can be ssh’d to (once again not good practice with Docker containers, but needed so that we can treat them like servers and use Ansible).  We then use an image which we refer to as the Ansible Control Container.  We create this image by installing Ansible installation and adding a Ansible hosts file that tells Ansible how to connect to the db and web “nodes” using the same key mentioned above.

With the environment in place the job runs the following ad hoc Ansible commands:

  1. ping all web nodes using the Ansible ping module: ansible web -m ping
  2. gather facts about the db node using the Ansible setup module: ansible db -m setup
  3. add a user to all web servers using the Ansible user module:  ansible web -b -m user -a “name=johnd comment=”John Doe” uid=1040″

By running the job and reading the console output you can see Ansible in action and then update the job to learn more.

3_Run_Your_Adhoc_Command

This job is identical to the job above in terms of setting up an environment to run Ansible.  However instead of having the hard-coded ad hoc Ansible commands listed above, it allows you to enter your own commands when running the job.  By default it pings all nodes:

ansible all -m ping

4_Run_A_Playbook

This job is identical to the job above in terms of setting up an environment to run Ansible.  However instead of passing in an ad hoc Ansible command, it lets you pass in an Ansible playbook to also run against the nodes.  By default the playbook that gets run installs Apache on the web nodes and PostgreSQL on the db node.  Of course you can change this to run any playbook you like so long as it is set to run on a host expression that matches: web-node-1, web-node-2, and/or db-node (or “all”).

How the jobs 2-4 work

To understand exactly how jobs 2-4 work, the code is reasonably well commented and should be fairly readable.  However, at a high-level the following steps are run:

  1. Create the Ansible inventory (hosts) file that our Ansible Control Container will need so that it can connect (ssh) to our db and web “nodes” to control them.
  2. Build the Docker image for our Ansible Control Container (install Ansible like the first Jenkins job, and then add the inventory file)
  3. Create a Docker network for our pretend server containers and our Ansible Control container to all run on.
  4. Create a docker-compose file for our pretend servers environment
  5. Use docker-compose to create our pretend servers environment
  6. Run the Ansible Control Container mounting in the Jenkins workspace if we want to run a local playbook file or if not just running the ad hoc Ansible command.

Conclusion

I hope this has been a useful read and has clarified a few things about Ansible, ADOP and Docker.  If you find this useful please star the GitHub repo and or share a pull request!

Bonus: here is an ADOP Platform Extension for Ansible Tower.

ADOP with Pivotal Cloud Foundry

As I have written here, the DevOps Platform (aka ADOP) is an integration of open source tools that is designed to provide the tooling capability required for Continuous Delivery.

In this blog I will describe integrating ADOP and the Cloud Foundry public PaaS from Pivotal.  Whilst it is of course technically possible to run all of the tools found in ADOP on Cloud Foundry, that wasn’t our intention.  Instead we wanted to combine the Continuous Delivery pipeline capabilities of ADOP with the industrial grade cloud first environments that Cloud Foundry offers.

Many ADOP cartridges for example the Java Petclinic one contain two Continuous Delivery pipelines:

  • The first to build and test the infrastructure code and build the Platform Application
  • The second to build and test the application code and deploy it to an environment built on the Platform Application.

The beauty of using a Public PaaS like Pivotal Cloud Foundry is that your platforms and environments are taken care of leaving you much more time to focus on the application code.  However you do of course still need to create an account and provision your environments.

  1. Register here
  2. Click Pivotal Web Services
  3. Create a free tier account
  4. Create and organisation
  5. Create one or more spaces

With this in place you are ready to:

  1. Spin up and ADOP instance
  2. Store your Cloud Foundry credentials in Jenkins’ Secure Store
  3. Load the Cloud Foundry Cartridge (instructions)
  4. Trigger the Continuous Delivery pipeline.

Having done all of this, the pipeline now does the following:

  1. Builds the code (which happens to be the JPetStore
  2. Runs the Unit Test and performs Static Code Analysis using SonarQube
  3. Deploys the code to an environment also known in Cloud Foundry as a Space
  4. Performs functional testing using Selenium and some security testing using OWASP ZAPP.
  5. Performs some performance testing using Gatling.
  6. Kills the running application in environment and waits to verify that Cloud Foundry automatically restores it.
  7. Deploys the application to a multi node Cloud Foundry environment.
  8. Kills one of the nodes in Cloud Foundry and validates that Cloud Foundry automatically avoids sending traffic to the killed node.

The beauty of ADOP is that all of this great Continuous Delivery automation is fully portable and can be loaded time and time again into any ADOP instance running on any cloud.

There is plenty more we could have done with the cartridge to really put the PaaS through its paces such as generating load and watching auto-scaling in action.  Everything is on Github, so pull requests will be warmly welcomed!  If you’ve tried to follow along but got stuck at all, please comment on this blog.

Abstraction is not Obsoletion – Abstraction is Survival

Successfully delivering Enterprise IT is a complicated, probably even complex problem.  What’s surprising, is that as an industry, many of us are still comfortable accepting so much of the problem as our own to manage.

Let’s consider an albeit very simplified and arguably imprecise view of The “full stack”:

  • Physical electrical characteristics of materials (e.g. copper / p-type silicon, …)
  • Electronic components (resistor, capacitor, transistor)
  • Integrated circuits
  • CPUs and storage
  • Hardware devices
  • Operating Systems
  • Assembly Language
  • Modern Software Languages
  • Middleware Software
  • Business Software Systems
  • Business Logic

When you examine this view, hopefully (irrespective of what you think about what’s included or missing and the order) it is clear that when we do “IT” we are already extremely comfortable being abstracted from detail. We are already fully ready to use things which we do not and may never understand. When we build an eCommerce Platform, an ERP, or CRM system, little thought it given to Electronic components for example.

My challenge to the industry as a whole is to recognise more openly the immense benefit of abstraction for which we are already entirely dependent and to embrace it even more urgently!

Here is my thinking:

  • Electrons are hard – we take them for granted
  • Integrated circuits are hard – so we take them for granted
  • Hardware devices (servers for example) are hard – so why are so many enterprises still buying and managing them?
  • The software that it takes to make servers useful for hosting an application is hard – so why are we still doing this by default?

For solutions that still involve writing code, the most extreme example of abstraction I’ve experienced so far is the Lambda service from AWS.  Some seem to have started calling such things ServerLess computing.

With Lambda you write your software functions and upload them ready for AWS to run for you. Then you configure the triggering event that would cause your function to run. Then you sit back and pay for the privilege whilst enjoying the benefits. Obviously if the benefits outweigh the cost for the service you are making money. (Or perhaps in the world of venture capital, if the benefits are generating lots of revenue or even just active users growth, for now you don’t care…)

Let’s take a mobile example. Anyone with enough time and dedication can sit at home on a laptop and start writing mobile applications. If they write it as a purely standalone, offline application, and charge a small fee for it, theoretically they can make enough money to retire-on without even knowing how to spell server.  But in practice most applications (even if they just rely on in app-adverts) require network enabled services. But for this our app developer still doesn’t need to spell server, they just need to use the API of the online add company e.g. Adwords and their app will start generating advertising revenue. Next perhaps the application relies on persisting data off the device or notifications to be pushed to it. The developer still only needs to use another API to do this, for example Parse can provide that to you all as a programming service.  You just use the software development kit and are completely abstracted from servers.

So why are so many enterprises still exposing themselves to so much of the “full stack” above?  I wonder how much inertia there was to integrated circuits in the 1950s and how many people argued against abstraction from transistors…

To survive is to embrace Abstraction!

 


[1] Abstraction in a general computer science sense not a mathematical one (as used by Joel Spolsky in his excellent Law of Leaky Abstractions blog.)

Join the DevOps Community Today!

As I’ve said in the past, if your organisation does not yet consider itself to be “doing DevOps” you should change that today.

If I was pushed to say the one thing I love most about the DevOps movement, it would be the sense of community and sharing.

I’ve never experienced anything like it previously in our industry.  It seems like everyone involved is united by being passionate about collaborating in as many ways as possible to improve:

  • the world through software
  • the rate at which we can do that
  • the lives of those working our industry.

The barrier to entry to this community is extremely low, for example you can:

You could also consider attending the DevOps Enterprise Summit London (DOES).  It’s the third DOES event and the first ever in Europe and is highly likely to be one of the most important professional development things you do this year.  Organised by Gene Kim (co-author of The Phoenix Project) and IT Revolution, the conference is highly focused on bringing together anyone interested in DevOps and providing them as much support as humanly possible in two days.  This involves presentations from some of the most advanced IT organisations in the world (aka unicorns), as well as many from those in traditional enterprises who may be on a very similar journey to you.   Already confirmed are talks from:

  • Rosalind Radcliffe talking about doing DevOps with Mainframe systems
  • Ron Van Kemenade CIO of ING Bank
  • Jason Cox about doing DevOps transformation at Disney
  • Scott Potter Head of New Engineering at News UK
  • And many more.

My recommendation is to get as many of your organisation along to the event as possible.  They won’t be disappointed.

Early bird tickets are available until 11th May 2016.

(Full disclosure – I’m a volunteer on the DOES London committee.)

London Banner logo_770x330