Minimum Viable Operations

Several years ago I started using the name Minimum Viable Operations to describe a concept I’d been tinkering with in relation to: Agile, DevOps, flow, service management and operational risk.  I define it as:

The minimum amount of processes, controls and security required to expose a new system or service to a target consumer whilst achieving the desired level of availability and risk.

These days I see Site Reliability Engineering (SRE) rapidly emerging as a set of processes that help achieve this.  Whilst chatting with a friend about it earlier today I realised I’d never published my blog on this, so here it is.

When you need Minimum Viable Operations

These days lots of organisations have successfully sped up the processes of rapidly innovating and developing new applications using modern lightweight architectures and cloud.  They are probably using an agile development methodology and hopefully some good Continuous Delivery perhaps even using something like the Accenture DevOps Platform (ADOP).  They may have achieved this by innovating on the edge i.e. creating an autonomous sub-organisation free from the trappings and the controls of the mainstream IT department.  I’ve worked with quite a few organisations who have done this in separate possibly more “trendy” offices known as Digital Hubs, Innovation Labs, Studios, etc.

But when the sprints are over, the show and tell is complete and the post-it notes start to drop off the wall… was it all worth it?

If hundreds or maybe even thousands of person hours rich with design thinking, new technologies, and automation culminate in an internal demo that doesn’t reach real end users, you haven’t become agile.  If you’ve accidentally diluted the term Minimum Viable Product (MVP) to mean Release 1 and that took 6 months to be completed, you haven’t become agile. 

Whilst prototypes do serve a purpose,

the highest value to an entrepreneurial organisation comes from proving or disproving whether target users are willing to use and keep using a product or service, and whether there are enough users to make it commercially viable.

This is what Eric Ries calls in the Lean Startup book the: Value Hypothesis and the Growth Hypothesis.  This requires putting something “live” in front of real users ASAP and to do that successfully I think you need Minimum Viable Operations (oh why not, let’s give it a TLA: MVO).

Alternative Options to MVO

Modern standalone agile development organisations have four options:

  1. Don’t bother with any operations and just keep building prototypes and dazzling demos.  The downside is that at some point someone may realise you don’t make money and shut you down.
  2. Don’t bother with MVO but try to get your prototype turned into live systems via a project in the traditional IT department.  The downside is that time to market will be huge and the rate of change of your idea will hit a wall.  CR anyone?
  3. Don’t bother with MVO and just put prototypes live.  The downside here is potentially unbounded risk of reputation and damages through outages and security disasters.
  4. Implement MVO, get things in front of customers, get into a nice tight lean OODA loop PDCA cycle perhaps even using the Cynefin framework, and overall create a learning environment capable of exponential growth.  The downside here is that it might feel a lot less safe than prototyping and you will have to learn to celebrate failure.

Obviously option 4 is my favourite.

Implementing MVO

Aside: for simplicity’s sake, I’m conveniently ignoring the vital topic of organisational governance and politics.  Clearly navigating this is of vital importance, but beyond what I want to get into here.

Implementing Minimum Viable Operations is essentially a risk management exercise with a lot in common with standard Operations Architecture (and nowadays SRE).  Everything just needs to be done much, much faster.  This means thinking very hard and upfront about the ilities such as reliability, availability, scaleability and security etc..  But everything needs to be done much faster and also firstly it requires you to be much more realistic and think harder:

  • Reliability – is 3 9’s really needed in order to get your Minimum Viable Product in front of customers and measure their reactions?  Possibly you don’t actually need to create a perfect user experience for everyone all the time and can evaluate your hypothesis’ based on the usage that you are able to achieve.  Obviously this presents something of a reputation-al risk putting out products of a level of service quality unbefitting of your brand.  But do you actually need to apply your brand to the product?  Or perhaps you can incentivise users to put up with unreliability via other rewards for example don’t charge.
  • Availability – we live in a 24 by 7 always on world, but at the start, do we need our MVP to be always online?  Perhaps if we strictly limit use by geography it becomes painless to have downtime windows.  Perhaps as an extension to the reliability argument above, taking the system down for maintenance, re-design, or even to discontinue it, is acceptable.
  • Scaleability – perhaps if our service gets so popular that we cannot keep it live that isn’t such a bad thing, in fact it could be exactly the problem that we want to have.  It’s like the old premature optimisation is the root of all evil arguement.  There are lots of ways to scale things when needed in the cloud.  Perhaps perfecting this isn’t a pre-requisite to go live.
  • Security – obviously this is a vital consideration.  But again this must be strictly “right sized”.  If a system doesn’t contain personal data or take payment information, you aren’t storing data and you aren’t overly concerned about availability, perhaps this doesn’t have to be quite the barrier to go live that you thought it might be.  Plus if you are practicing good DevSecOps in your development process, and using Nexus Lifecycle, you are off to a good start.

Secondly unlike a traditional Operations Architecture scenario where SLAsRPOs, etc. may have been rather arbitrarily defined (perhaps copied and pasted into a contract), when creating MVO for an MVP, you may actually have the opportunity to change the MVP design to make it easier to put live.  Does the MVP actually need to: be branded with the company name, take payment, require registration, etc.?  If none of that impacts your hypothesis’, consider dropping them and going live sooner.

Finally as a long time fan of PaaS, I also have mention that organisations can also make releasing of MVPs much easier if they have invested in a re-usable hosting platform (which of course may benefit from MVO).

What can someone do with this

I’ve seen value in organisations creating a relatively light methodology for doing something similar to MVO.  Nowadays they also might (perfectly reasonably) also call it implementing SRE.  It could include:

  1. Guidelines for making MVPs smaller and easier to put live and operate.
  2. Risk assessment tools to help right-size Operations.
  3. Risk profiles for different types of application.
  4. Peer review checklists to validate someone else’s minimum viable operations architecture.
  5. Reusable platform (PaaS) services with opinions that natively provide certain safety as well as freedom.
  6. Re-usable security services.
  7. Re-usable monitoring solutions.
  8. Training around embracing learning form failure and creating good customer experiments.
  9. A general increase in focus on and awareness of how quickly we are actually putting things live for clients.
  10. When and how to transition from MVO to a full blown SRE controlled service.

Final Comment

DevOps has alas in many cases become conflated with implementing Continuous Delivery into testing environments.  Some times the last mile into Production is still too long.  Call it MVO (probably don’t!), call it SRE (why not?), still call it DevOps, Agile, or Lean (why not?)… I recommending giving some thought to MVO.

Advertisements

Transforming the future together

We have chosen “Transforming the future together” as the theme for the DevOpsDays London 2017 Conference.  I wanted to share here some personal thoughts about what it means.  These may not be the same views of the other organisers (including Bob Walker who proposed the theme).  If you are reading this before 31st May 2017 there is still time for you to submit a proposal to talk!

One of the things I find most inspiring about the DevOps community is the level of sharing, not just in terms of Open Source software, but in terms of strong views on culture, processes, metrics and usage of technology.  I am especially excited by the theme because of the potential for expanding what we share.

ArcaBoard_Dumitru_Popescu

If we are going to transform the future I think to an extent we have to hold an interpretation of:

  • where we are today,
  • where the future might be heading without our intervention,
  • how we would like it to be, and
  • what we might do to influence it.

Even standing alone I can imagine each of these having lots of potential.

The great thing about hearing about the experiences of others is that it can be a shortcut to discovering things without having to try them yourselves.  For example you are walking in the country and you pass someone covered in mud “The middle of this track is much more slippery that it looks”, they volunteer with a pained smile.  Whether you choose to take the side of the path or test your own ability to brave the slippery part, you are much better off for the experience.

I hope to see some talks at DevOpsDays London that attempt to describe with great balance generally where they are today as technology organisations.   Not just how many times a day they can change static wording on their website, but also how fast they change for example their Oracle packaged products and what their bottlenecks are.  

“The future is already here — it’s just not very evenly distributed.”  William Gibson

Naturally everyone is at a different stage of their DevOps journey.  I’m hoping there will be a good range of talks reflecting this variety.

I would love to hear some predictions optimistic and pessimistic about where we might be and where we should be heading.  How will the world change when we cross 50% of people being online?  Are robots really coming for our jobs or our lives? Is blockchain really going to find widespread usage?  Will the private data centre be marginalised as much as private electricity generators have been?  Will Platform-as-a-Service finally gain widespread traction, will it be eclipsed by ServerLess?  Will 100% remote working liberate more of us from living near expensive cities?  Will the debate about what is and isn’t Agile ever end?  What is coming after Micro Services?  Will the gender balance in our industry become more equal?  Can holacracies succeed as a form of organisation?  These questions are of course just the start.

Finally what are people doing now towards achieving some of the above.  What technologies are really starting to change the game?  What new ways of working and techniques are really making an impact (positive or negative).  How are they working to be more inclusive and improve diversity?  Are they actually liking bi-modal IT?  What are they doing to alleviate their bottlenecks?

Please share your own questions that you would like to see answered and once again please consider submitting a proposal.

Image: https://en.wikipedia.org/wiki/File:ArcaBoard_Dumitru_Popescu.jpg

How Psychologically Safe does your team feel?

As per this article, Google conducted a two-year long study into what makes their best teams great and found psychological safety to be the most important factor.

As per Wikipedia, psychological safety can be defined as:

“feeling able to show and employ one’s self without fear of negative consequences of self-image, status or career”

It certainly seems logical to me that creating a safe working environment where people are free to share their individual opinions and avoid group think, is highly important.

So the key question is how can you foster psychological safety?

Some of the best advice I’ve read was from this this blog by Steven M Smith.  He suggests performing the paper-based voting exercise to measure safety.

Whilst we’ve found this to be good technique, the act of screwing up bits of paper is tedious and hard to do remotely.  Hence we’ve created an online tool:

https://safetychecker.herokuapp.com/

Please give it a go and share your experiences!

How my team do root cause analysis

This blog is more or less a copy and paste of a wiki page that my team at work use as part of our Problem Management process.  It is heavily inspired by lots of good writing about blameless postmortems for example from Etsy and the Beyond Blame book.  Hope you find it useful.

RCA Approach

 

This page describes a 7 step approach to performing RCAs.  The process belongs to all of us, so please feel free to update it.

Traditionally RCA stands for Root Cause Analysis.  However, there are two problems with this:

  1. It implies there is one root cause.  In practice it is often a cocktail of contributing causes as well as negative (and sometimes positive) outcomes
  2. The name implies that we are on a hunt for a cause.  We are on a hunt for causes, but only to help us identify preventative actions.  Not just to solve a mystery or worse find an offender to punish.

Therefore RCA is proposed to stand for Recurrence Countermeasure Analysis.

Step 1: Establish “the motive”

Ask the following:

Question: Does anyone think anyone in our team did something deliberately malicious to cause this?  i.e. they consciously carried out actions that they knew would cause this or something of similar negative consequences or they clearly understood the risks but cared so little that they weren’t deterred?

and

Question: Does anyone think anyone outside our team… (as above).

The assumption here is that the answer is “NO” to both questions.  If it is “NO”, we can now proceed with a blameless manner, i.e. never stopping our analysis at a point where a person should (or could) have done something different.

If either answers are “YES”.  This is beyond the scope of this approach.

Step 2: Restate our meaning of “Blameless”

Read aloud the following to everyone participating in the RCA:

“We have established that we don’t blame any individual either internal or external to our organisation for the incident that has triggered this exercise.  Our process has failed us and needs our collective input to improve it.  If at any point during the process anyone starts to doubt this statement or act like they no longer believe it we must return to Step 1.  Everyone is responsible for enforcing this.

What is at stake here is not just getting to the bottom of this incident, it’s getting to the bottom of this incident and every future occurrence of the same incident.  If anyone feels mistreated by this process, by human nature they will take actions in the future to disguise their actions to limit blame and this will damage our ability to continuously improve.”

Step 3: Restate the rules

During this process we will follow these rules:

  1. Facts must not be subjective.  If an assertion of fact cannot be 100% validated we should agree and capture our confidence level (e.g. High, Medium, Low).  We must also capture the actions that we could do to validate it.
  2. If we don’t have enough facts, we will prioritise the facts that we need go away and validate before reconvening to continue.  Before suspending the process, agree a full list of “Things we wish we knew but don’t know”, capture the actions that we could do to validate them and prioritise the discovery.
  3. If anyone feels uncomfortable during the process due to:
    1. Blame
    2. Concerns with the process
    3. Language or tones of voice
    4. Their ability have their voice heard they must raise it immediately.
  4. We are looking for causes only to inform what we can do to prevent re-occurrence, not to apportion blame.

Step 4: Agree a statement to describe the incident that warranted this RCA

Using an open discussion attempt to reach a consensus over a statement that describes the incident that warranted this RCA.  This must identify the thing (or things) that we don’t want to happen again (including all negative side-effects).  Don’t forget the impact on people e.g. having to work late to fix something.  Don’t forget to capture the problem from all perspectives.

Write this down somewhere everyone can see.

Step 5: Mark up the problem statement

Look at the problem statement and identify and underline every aspect of the statement that someone could ask “Why” about.  Try to take an outsider view, even if you know the answer or think something cannot be challenged, it is still in scope for being underlined.

Step 6: Perform the analysis

Document the “Why” question related to each underlined aspect in the problem statement.

For each “Why” question attempt to agree on one direct answer.  If you find you have more than one direct answer, split your “Why” question into enough more specific “Why” questions so that your answers can be correlated directly.

Mark up the answers as you did in Step 5.

Repeat this step until you’ve built up a tree with at least 5 answers per branch and at least 3 branches.  If you can’t find at least 3 branches, you need to ask more fundamental “Why” questions about your problem statement and answers.  If you can’t ask and answer more than 5 “Why”s per branch possibly you are taking too large steps.

Do not stop this process with any branch ending on a statement that could be classified “human error”.  (Refer to what we agreed at step 1).

Do not stop this process at something that could be described as a “third party error”.  Whilst the actions of third parties may not be directly under our control, we have to maintain a sense of accountability for the problem statement where if necessary we should have implemented measures to protect ourselves from the third party.

Step 7: Form Countermeasure Hypothesis

Review the end points of your analysis tree and make hypothesis’ about actions that could be taken to prevent future re-occurrences. Like all good hypothesis’ these should be specific and testable.

Use whatever mechanism you have for capturing and prioritising the proposed work to track the identified actions and get them implemented.  Use your normal approach to stating acceptance criteria and don’t close the actions unless they satisfy the tests that they have been effective.

 

Using ADOP and Docker to Learn Ansible

As I have written here, the DevOps Platform (aka ADOP) is an integration of open source tools that is designed to provide the tooling capability required for Continuous Delivery.  Through the concept of cartridges (plugins) ADOP also makes it very easy to re-use automation.

In this blog I will describe an ADOP Cartridge that I created as an easy way to experiment with Ansible.  Of course there are many other ways of experimenting with Ansible such as using Vagrant.  I chose to create an ADOP cartridge because ADOP is so easy to provision and predictable.  If you have an ADOP instance running you will be able to experience Ansible doing various interesting things in under 15 minutes.

To try this for yourself:

  1. Spin up and ADOP instance
  2. Load the Ansible 101 Cartridge (instructions)
  3. Run the jobs one-by-one and in each case read the console output.
  4. Re-run the jobs with different input parameters.

To anyone only loosely familiar with ADOP, Docker and Ansible, I recognise that this blog could be hard to follow so here is a quick diagram of what is going on.

docker-ansible

The Jenkins Jobs in the Cartridge

The jobs do the following things:

As the name suggests, this job just demonstrates how to install Ansible on Centos.  It installs Ansible in a Docker container in order to keep things simple and easy to clean up.  Having build a Docker image with Ansible installed, it tests the image just by running inside the container.

$ ansible --version

2_Run_Example_Adhoc_Commands

This job is a lot more interesting than the previous.  As the name suggests, the job is designed to run some adhoc Ansible commands (which is one of the first things you’ll do when learning Ansible).

Since the purpose of Ansible is infrastructure automation we first need to set up and environment to run commands against.  My idea was to set up an environment of Docker containers pretending to be servers.  In real life I don’t think we would ever want Ansible configuring running Docker containers (we normally want Docker containers to be immutable and certainly don’t want them to have ssh access enabled).  However I felt it a quick way to get started and create something repeatable and disposable.

The environment created resembles the diagram above.  As you can see we create two Docker containers (acting as servers) calling themselves web-node and one calling it’s self db-node.  The images already contain a public key (the same one vagrant uses actually) so that they can be ssh’d to (once again not good practice with Docker containers, but needed so that we can treat them like servers and use Ansible).  We then use an image which we refer to as the Ansible Control Container.  We create this image by installing Ansible installation and adding a Ansible hosts file that tells Ansible how to connect to the db and web “nodes” using the same key mentioned above.

With the environment in place the job runs the following ad hoc Ansible commands:

  1. ping all web nodes using the Ansible ping module: ansible web -m ping
  2. gather facts about the db node using the Ansible setup module: ansible db -m setup
  3. add a user to all web servers using the Ansible user module:  ansible web -b -m user -a “name=johnd comment=”John Doe” uid=1040″

By running the job and reading the console output you can see Ansible in action and then update the job to learn more.

3_Run_Your_Adhoc_Command

This job is identical to the job above in terms of setting up an environment to run Ansible.  However instead of having the hard-coded ad hoc Ansible commands listed above, it allows you to enter your own commands when running the job.  By default it pings all nodes:

ansible all -m ping

4_Run_A_Playbook

This job is identical to the job above in terms of setting up an environment to run Ansible.  However instead of passing in an ad hoc Ansible command, it lets you pass in an Ansible playbook to also run against the nodes.  By default the playbook that gets run installs Apache on the web nodes and PostgreSQL on the db node.  Of course you can change this to run any playbook you like so long as it is set to run on a host expression that matches: web-node-1, web-node-2, and/or db-node (or “all”).

How the jobs 2-4 work

To understand exactly how jobs 2-4 work, the code is reasonably well commented and should be fairly readable.  However, at a high-level the following steps are run:

  1. Create the Ansible inventory (hosts) file that our Ansible Control Container will need so that it can connect (ssh) to our db and web “nodes” to control them.
  2. Build the Docker image for our Ansible Control Container (install Ansible like the first Jenkins job, and then add the inventory file)
  3. Create a Docker network for our pretend server containers and our Ansible Control container to all run on.
  4. Create a docker-compose file for our pretend servers environment
  5. Use docker-compose to create our pretend servers environment
  6. Run the Ansible Control Container mounting in the Jenkins workspace if we want to run a local playbook file or if not just running the ad hoc Ansible command.

Conclusion

I hope this has been a useful read and has clarified a few things about Ansible, ADOP and Docker.  If you find this useful please star the GitHub repo and or share a pull request!

Bonus: here is an ADOP Platform Extension for Ansible Tower.

ADOP with Pivotal Cloud Foundry

As I have written here, the DevOps Platform (aka ADOP) is an integration of open source tools that is designed to provide the tooling capability required for Continuous Delivery.

In this blog I will describe integrating ADOP and the Cloud Foundry public PaaS from Pivotal.  Whilst it is of course technically possible to run all of the tools found in ADOP on Cloud Foundry, that wasn’t our intention.  Instead we wanted to combine the Continuous Delivery pipeline capabilities of ADOP with the industrial grade cloud first environments that Cloud Foundry offers.

Many ADOP cartridges for example the Java Petclinic one contain two Continuous Delivery pipelines:

  • The first to build and test the infrastructure code and build the Platform Application
  • The second to build and test the application code and deploy it to an environment built on the Platform Application.

The beauty of using a Public PaaS like Pivotal Cloud Foundry is that your platforms and environments are taken care of leaving you much more time to focus on the application code.  However you do of course still need to create an account and provision your environments.

  1. Register here
  2. Click Pivotal Web Services
  3. Create a free tier account
  4. Create and organisation
  5. Create one or more spaces

With this in place you are ready to:

  1. Spin up and ADOP instance
  2. Store your Cloud Foundry credentials in Jenkins’ Secure Store
  3. Load the Cloud Foundry Cartridge (instructions)
  4. Trigger the Continuous Delivery pipeline.

Having done all of this, the pipeline now does the following:

  1. Builds the code (which happens to be the JPetStore
  2. Runs the Unit Test and performs Static Code Analysis using SonarQube
  3. Deploys the code to an environment also known in Cloud Foundry as a Space
  4. Performs functional testing using Selenium and some security testing using OWASP ZAPP.
  5. Performs some performance testing using Gatling.
  6. Kills the running application in environment and waits to verify that Cloud Foundry automatically restores it.
  7. Deploys the application to a multi node Cloud Foundry environment.
  8. Kills one of the nodes in Cloud Foundry and validates that Cloud Foundry automatically avoids sending traffic to the killed node.

The beauty of ADOP is that all of this great Continuous Delivery automation is fully portable and can be loaded time and time again into any ADOP instance running on any cloud.

There is plenty more we could have done with the cartridge to really put the PaaS through its paces such as generating load and watching auto-scaling in action.  Everything is on Github, so pull requests will be warmly welcomed!  If you’ve tried to follow along but got stuck at all, please comment on this blog.

Abstraction is not Obsoletion – Abstraction is Survival

Successfully delivering Enterprise IT is a complicated, probably even complex problem.  What’s surprising, is that as an industry, many of us are still comfortable accepting so much of the problem as our own to manage.

Let’s consider an albeit very simplified and arguably imprecise view of The “full stack”:

  • Physical electrical characteristics of materials (e.g. copper / p-type silicon, …)
  • Electronic components (resistor, capacitor, transistor)
  • Integrated circuits
  • CPUs and storage
  • Hardware devices
  • Operating Systems
  • Assembly Language
  • Modern Software Languages
  • Middleware Software
  • Business Software Systems
  • Business Logic

When you examine this view, hopefully (irrespective of what you think about what’s included or missing and the order) it is clear that when we do “IT” we are already extremely comfortable being abstracted from detail. We are already fully ready to use things which we do not and may never understand. When we build an eCommerce Platform, an ERP, or CRM system, little thought it given to Electronic components for example.

My challenge to the industry as a whole is to recognise more openly the immense benefit of abstraction for which we are already entirely dependent and to embrace it even more urgently!

Here is my thinking:

  • Electrons are hard – we take them for granted
  • Integrated circuits are hard – so we take them for granted
  • Hardware devices (servers for example) are hard – so why are so many enterprises still buying and managing them?
  • The software that it takes to make servers useful for hosting an application is hard – so why are we still doing this by default?

For solutions that still involve writing code, the most extreme example of abstraction I’ve experienced so far is the Lambda service from AWS.  Some seem to have started calling such things ServerLess computing.

With Lambda you write your software functions and upload them ready for AWS to run for you. Then you configure the triggering event that would cause your function to run. Then you sit back and pay for the privilege whilst enjoying the benefits. Obviously if the benefits outweigh the cost for the service you are making money. (Or perhaps in the world of venture capital, if the benefits are generating lots of revenue or even just active users growth, for now you don’t care…)

Let’s take a mobile example. Anyone with enough time and dedication can sit at home on a laptop and start writing mobile applications. If they write it as a purely standalone, offline application, and charge a small fee for it, theoretically they can make enough money to retire-on without even knowing how to spell server.  But in practice most applications (even if they just rely on in app-adverts) require network enabled services. But for this our app developer still doesn’t need to spell server, they just need to use the API of the online add company e.g. Adwords and their app will start generating advertising revenue. Next perhaps the application relies on persisting data off the device or notifications to be pushed to it. The developer still only needs to use another API to do this, for example Parse can provide that to you all as a programming service.  You just use the software development kit and are completely abstracted from servers.

So why are so many enterprises still exposing themselves to so much of the “full stack” above?  I wonder how much inertia there was to integrated circuits in the 1950s and how many people argued against abstraction from transistors…

To survive is to embrace Abstraction!

 


[1] Abstraction in a general computer science sense not a mathematical one (as used by Joel Spolsky in his excellent Law of Leaky Abstractions blog.)