Team Topologies Book Summary – Part 2 of 3: Topologies and Interaction Modes

In part 1 of this 3 part blog series about the Team Topologies book, I summarised a large set of general things that can help make teams successful.  In this part I will get to actual team design i.e. types of team and how they should interact..

The Four Team Types

Based on their extensive research, the authors present 4 types of team and make an important assertion: to be an effective organisation you only need a mixture of teams conforming to these 4 types.

Personally, I really identified with the 4 types, but even if you have different opinions, it is still very useful to be able to reference and extend / update the common set of types described (in such great detail) in the book.

The 4 types are as follows.

Stream Aligned Teams

This is the name the authors coined for an end-to-end team that owns the full value stream for delivering changes to software products or services AND operating them.  This is the stereotypical modern and popular “you build it, you run it team”.

For Stream Aligned teams to be able to do their job, they need to be comprised of team members that collectively have all of the different skills they need e.g. business knowledge, architecture, development and testing skills, UX, security and compliance knowledge, monitoring and metrics skills etc.  This enables them to play nicely with the requirement of minimising queues and handoffs to other teams (especially in comparison to teams comprised of people performing singular functions e.g. testing).

The expectation is that by teams owning their whole value stream including the performance of the system in production, they can optimise for rapid small batch size changes, and reap all of the expected benefits around both agility and safety.  The hope is also that this end to end scope might help teams achieve “autonomy, mastery and purpose” (things which Daniel Pink highlights as most important to knowledge workers).

The amount of cognitive load required of them is a factor of a few things:

  • If they have enough of the requisite skills in their team, intrinsic cognitive load will be manageable.
  • If the modes of interaction with other teams are clean and efficient, their extrinsic cognitive load shouldn’t be too high.  They can also minimise this within their team through automation.
  • If they were formed using fracture planes effectively enough that they can keep their scope (and therefore their germane cognitive load) to a manageable amount.

Overall this seems like a very compelling team type and indeed the book recommends that the majority of teams in organisations have this form.

The book goes into a lot more detail about how to make a Stream Aligned Team effective which I highly recommend reading.

If the book only included Stream Aligned Teams, I would have considered it a cynical attempt to document fashionable ideas and ignore the factors that have led many people to have other team types.  Fortunately, instead they did a great job at considering the whole picture via 3 other types (plus bonus type SRE – read on!)

Platform Teams

In a tech start-up that begins with just one Stream Aligned Team owning the full stack and software lifecycle end-to-end, at some point they will face into Dunbar’s Number and cognitive overload.  It will be time to look for a fracture planes to enable splitting into at least two teams.  Separating the platform from the application is a very successful and well-established pattern.  At this point we could potentially have one Stream Aligned Team working on the application and one on the platform.

Let’s say the business and demand for application complexity continues to increase and the application team splits into two teams focussed on different application value streams.  We’re now in a situation where they both most likely re-use the same platform team and we get the benefit of re-using it.  This could of course repeat many times.  The authors recognised the nature of running a platform team (especially in terms of the type of coupling to other teams and effective modes of communication) differed enough to create a new team type and this they called Platform Team.

Platform Teams create and operate something re-usable for meeting one or more application hosting requirements.  These could be running platform applications like Kubernetes, a Content Management System as a service, IaaS, or even teams wrapping third party as-a-service services like a Relational Database Service.  They may also be layering on additional features valuable to the consumers of the platform such as security and compliance scanning.

But there are potential traps with platform teams.  If the platform is too opinionated and worse usage is also mandated within an organisation the impedance mismatch may be more harm than good for consuming applications.  If a platform isn’t observable and applications are deployed into a black box, the application teams will be disconnected from enough detail about how their application is performing in production and disempowered to make a difference to it.  The book goes into a lot of detail about how to avoid creating bad platform teams and strongly makes an argument for keeping platforms as thin as possible (which it calls minimum viable platform).

Personally, I think the dynamic that occurs when consuming a platform from a third party is very powerful for a number of reasons:

  • The platform provider is under commercial pressure to create a platform good enough that consumers pay to use it.
  • The platform provider has to be able to deliver the service at a cost point below the total revenue it’s users will give it (so it will be efficient and chose it’s offered services carefully)

If organisations can at least try to think like that (with or without internal use of “wooden dollars”) I think they will create an effective dynamic.  The book mentions that Don Reinertsen recommends internal pricing as a mechanism for avoiding platform consumers demanding things they don’t need.

Enabling Teams

The next type of team is designed to serve other teams to help them improve.  I think it is great that the book acknowledges and explores these as they are very common and often a good thing.  Enabling teams should contain experienced people excited about finding ways to help other teams improve.  To some extent they can be thought of as technical consultants with a remit for helping drive improvement.

Most important is how well Enabling teams engage with other teams.  There are various traps such a team can fall into and must avoid:

  • Becoming an ivory tower defining processes, policies, perhaps even technical ‘decisions’ and inflicting them upon the teams that they are supposed to be helping.
  • Generally being disruptive by causing things like interruptions, context switching, cognitive load, and communication overhead – especially if this cost outweighs the benefit.

As with all other types, the book provides detailed expected behaviours and success criteria for enabling teams such as understanding the needs of the teams they support and taking an experimental approach to meeting these.  It’s also important that enabling teams bring in a wider perspective perhaps of ideas about technology advances and industry technology and process trends.  Enabling teams may be long lived but their engagement with the teams they are supporting should probably not be permanent.

Complicated-Subsystem Teams

Finally, the book defines a 4th type of team called a complicated-subsystem team.  Essentially these teams own the development and maintenance of a complicated subcomponent that is probably consumed as code or a binary, rather than at runtime as a service over a network.  The concept is that the component requires specialist knowledge to build and change and that can be done most effectively by a dedicated team without the cognitive load required to consume and deploy their component.

Other types of teams: SRE teams

I did feel the book acknowledged some other team types outside of the main 4, for example, SRE teams.

Without getting into too much detail, Site Reliability Engineering (SRE) is an approach to Operations that Google developed and started sharing with the world a couple of years ago.  The thing that is most interesting about it from a team design perspective is that an SRE team is a traditional separate operations team.  In this regard it doesn’t really fit the 4 topologies above.  I think there are a couple of reasons the SRE model is successful:

  1. At least at Google, the default mode of operation is Stream Aligned teams.  An SRE team will only operate an application if it is demonstrably reliable and doesn’t require a lot of manual effort to operate.
  2. It promotes use of some conceptually simple but effective metrics to ensure the application meets the standards for an SRE to operate it.

The book says that if your systems are reliable and operable enough you can use SRE teams.  Doing so will of course reduce some cognitive load on the team by having to pay less attention to the demands of an application in production.  This is actually very interesting because in some ways a Stream Aligned team handing over applications to an SRE team to operate is very similar to a traditional Dev and Ops team.

So, the message in a way becomes if your processes are mature enough and your engineering good enough, traditional ways will suffice, BUT the problem is they probably aren’t and instead you need to use Stream Aligned teams until you get there.  I think this leads to an alternative option for organisations – directly implement the measurement of SRE and focus on quality until you can make your existing separate Dev and Ops perform well enough.   I even suggest this justifies a first class fifth team type.

Other types of teams: Service Experience

The book also acknowledges that many business models include call centre teams and service desk teams who engage with customers directly.  These teams are proposed to be kept separate from the team types above but also to ensure information about the operations of the live system makes it back to the Stream Aligned teams.  It also talks about the concept of grouping and tightly aligning Service Experience Teams with Stream Aligned teams.  The Service Experience Teams are named this to emphasise their customer orientation.

A key success criterion is the relationship to Stream Aligned teams which must be close and ideally 1 to 1.  I think this is an excellent point and that in many organisations it’s as much the 1 to many (overly shared) relationship between teams that should be more closely aligned that cause as many problems as the team boundaries.  For example, if an Application support team is spread across too many applications whilst it may achieve some efficiencies of scale, the overall service delivered to each supported team will be far less effective than if the team was subdivided into smaller more closely aligned teams.

If you spotted other team types in the book that were acknowledged as acceptable – let me know.

The Three Modes of Team Interaction

The book creates 3 interaction modes and recommends which team types should use each one.  By interactions the book means how and when teams communicate and collaborate with each other.  Central to this is the idea that communication is expensive and erodes team structures, so should be used wisely.

The first interaction mode the book defines is called Collaboration.  This is a multi-directional regular and close interaction.  It is fast if not often real time and therefore responsive and agile.  It is the mode than enables teams to work most closely together and is useful when there is a high degree of uncertainty or change and co-evolution is needed.

The cost of this mode is higher communication overhead and increased cognitive load – because teams will need to know more about the other teams.

Stream Aligned Teams will be the most likely to use this, probably with other Stream Aligned teams and especially earlier in the life of a particular product when uncertainty is highest, and boundaries are evolving.

The second mode is called X-as-a-Service.  A team wishing to use this interaction mode needs to consciously design and optimise it so that it serves both them and the other parties as effectively as possible.

Making this mode effective entails abstracting unnecessary detail from other teams and making the interface discoverable, self-documented and possibly an API.

The down side of this mode is that if it isn’t implemented effectively it may reduce the flow of other teams.  It also needs to stay relatively static in order to avoid consuming teams having to constantly relearn how to integrate.

As you’ve probably guessed this is a great model for a Platform Team to adopt – especially if they are highly re-used by many consumers.  It can be especially effective when platforms are well established and do not require rapid change or co-evolution of the API.

The final mode is called Facilitating.   This is the best mode for Enabling teams and describes how enabling teams can be helpful without being over demanding.

All of the above modes are described in much more detail than this and have very actionable advice

Continuous Evolution

The book has a section on static topologies i.e. how teams may interact at a point in time.  However, it stresses the importance of sensing and adapting.  As the maturity of a team changes collaboration modes and even team types may change or subdivide.

The Team Topologies book uses SRE as a good example of teams moving from Stream Aligned, to a function containing reliability specialists (who may or may not call themselves SREs) operating as an enabling team, to an SRE team acting as a separate Ops team and then back.  Obviously, there is a tension with making these changes to keep teams effective with the costs of changing teams and resetting back to Storming (as per Tuckman).

The book even offers advice for helping existing functional or traditional teams transition into the new model for example: infrastructure teams to Platform teams, support teams to Stream Aligned teams.

In the final part of this series, I will share some thoughts about how to use this information.

Team Topologies Book Summary – Part 1 of 3: Key Concepts

Team Topologies is one of the latest books published by IT Revolution (the excellent company created by Gene Kim the co-author of The Phoenix Project and author of The Unicorn Project).  Team Topologies was written by Matthew Skelton and Manuel Pais (the people behind the DevOps Topologies website) but it is far more than an ‘extended dance remix’ of that.  The book takes a much wider view on team designs that make companies successful and considers the full socio-technical problem space.

This blog series is a set of highlights from the book followed by some suggestions of what to do with the information.  I hope it will also inspire you to read the book and have some fresh thoughts about improving your organisation.

Conway’s Law Demonstrates Team structures are incredibly important

Conway’s Law is the famous observation that the structure of teams (or more specifically the structure of information flow between groups of people) impacts the design of a system.  For example, if you take a 4-team organisation and ask them to build a compiler, you will most likely get a compiler that has 4 stages to it.  This is important because of the implication that the structure of Teams doesn’t just impact governance, efficiency, and agility, it also impacts the actual architecture of the products that get built.  The architecture of products is vitally important because of the heavy bearing it has on the agility and reliability of not just systems, but the businesses driven by them.

So Conway’s law teaches us two things:

  1. Team designs are very important because good team designs lead to good software design and good software design leads to better more effective teams (or if you get it wrong the cycle goes the other way).
  2. The factors that influence both a good architecture as well as good teams need to be considered when designing teams, team boundaries, and the planned communication required between teams.

General guidance for great teams

The book doesn’t overlook covering general factors that influence high performing people and teams.  For example:

  • Team sizing – Dunbar’s number is highlighted for its recommendation about the limits of how many people can successfully collaborate in different ways:
    • The strongest form of communication happens in an individual team working closely together with a very consistent shared context and specific purpose.  In software this number is supposed to be 8 people.
      • The limit of people who can all deeply trust each other is 15 and there is also significance of the dynamics of 50 and 150 people.  So this can provide useful constraints when thinking about the design of teams of teams.
  • The importance of having long lived teams that are given enough time to get to a state of high performance and then stick together and capitalise on it.  Research shows teams can take 3 months to get to a point where a team is many times more effective than the sum of the individual members.  The better the overall company culture is, the easier changing teams can be, but even in the best cases the research recommends no less than a year.  The book also highlights that the well-known Tuckman team performance model (Forming, Storming, Norming, Performing) has been proven to be less linear than it sounds.  The storming stage has been found to restart every time there is a personnel or major change to the team.
  • Some aspects of office design are important.
  • Putting other team members first and generally investing in relationships and making the team an inclusive place to work is vital.
  • Defining effective team scope, boundaries, and approaches to communication.  This is the broadly the topic of the rest of the book (and this blog).

Team Design Consideration: Minimising Team Handoffs, Queues, and Communication Overhead

The book talks about the importance of organising for flow of change to software products.  In order to do that you need to consider team responsibilities and to minimise and optimise communication.  The book presents a case study from Amazon as exponents of “you build it, you run it” approaches.

You also need what the book calls “sensing” which is where an organisation possesses enough feedback mechanisms to ensure software and service quality is understood as early and clearly as possible.

Whilst teams may communicate and prioritise effectively within the boundary of their team, external communication across team boundaries is always much more costly and much less effective.  When teams have demands upon other teams, this can:

  • be disruptive and lead to context switching
  • lead to queues, delays, and prioritisation problems
  • create a large communication and management overhead.

I found the point about communication thought provoking because it’s often a popular idea that the more collaboration within an organisation, the better.  As the book states, in practice all communication is expensive and should be considered in terms of cost versus benefit.  A friend of mine Tom Burgess pointed out a nice similarity to memory hierarchy in Computer Science.  This also got me thinking about the parallels of people and the Fallacies of Distributed computing!

Team Design Consideration: Cognitive load

The book introduces a very helpful concept from psychology (created by John Sweller) called Cognitive Load.  This describes how much ‘thinking effort’ a particular ‘thing’ needs in order to be done effectively.  It observes that not all types of thinking effort are the same in nature:

  • Different tasks require different types of thinking
  • There are different causes of the need to think
  • There are different strategies for managing and reducing the effort required for the different types of thinking.

The types are:

  • Intrinsic Cognitive Load – this relates to the skill of how to do a task e.g. which buttons to press according to the specific information available and scenario.
  • Extraneous Cognitive Load – this relates to broader environmental knowledge, not related to the specific skill of the task, but still necessary, e.g. what are the surrounding process admin steps that must be done after this task
  • Germane Cognitive Load – this relates to thinking about how to make the work as effective as possible e.g. what should the design be, how will things integrate.

The strategies for managing each type of cognitive load are as follows:

  • Intrinsic Cognitive Load – can be reduced by training people or finding people experienced at a task.  The greater their relevant skill levels, the lower the load required to do the task.
  • Extraneous Cognitive Load – can be reduced by automation or good process design and discoverability.
  • Germane Cognitive Load – is really the type of work you want people focusing on.  However, the amount required is a factor of how much scope someone has to worry about.

This is all very interesting and applicable to individuals, but you might be wondering: how does this relate to organisational team design?  The book presents a very useful idea here: cognitive load should be considered in terms of the total amount required by whole teams.

Putting this in plainer terms, when you are thinking about team design and organisational structure, you need to consider how much collectively you are expecting the team to:

  • know
  • be able to perform
  • be able to make effective decisions about
  • be able to make brilliant.

The reason I found this so powerful is because it gives you a logical way to reason with the current fairly fashionable idea that end-to-end / DevOps / BusDevSecOps(!) teams are the utopia (or worse: the meme that if there are separate teams in the value stream you are not doing DevOps and you are doing it wrong).  Sure, giving as much ownership of a product or service as possible to one team avoids team boundaries, but it also increases the cognitive load on the team and potentially the minimum number of people needed in a team.

Decomposing and fracture planes

So a simplified summary so far:

  • Amazon have helped highlight the benefit of minimising handoffs, queues, and communication
  • Sweller has taught us to avoid giving a team too much Cognitive load
  • Dunbar has taught us to keep teams at around 8 people.

At this point we have to consider the amount of scope assigned to a team as our way to satisfy the above constraints.  If we can keep it small enough, perhaps we can still give them end-to-end autonomy whilst keeping team sizes down and cognitive load manageable.

This is where the book starts talking about a concept of Fracture Plains.  The name is a metaphor about how when stonemasons break rocks, they focus on the natural characteristics of the structure of the rock in order to break them up cleanly and efficiently.  The theory is that software systems also naturally have characteristics that create more effective places to divide things up.  The metaphor is especially poetic considering large tightly coupled software systems are often likened to monolithic rocks.

The book provides a useful discussion of different types of fracture plane to explore including:

  • Business domains (i.e. where the whole domain driven development and bounded contexts design techniques come into play)
  • Change cadence i.e. things that need to change at the same pace
  • Risk i.e. things with similar risk profiles
  • (Most important of all) separation of platform and application (more on that to come).

Ideally all of these can help decompose systems whilst minimising cognitive load and keeping team sizes small enough.

In Part 2 I’ll share the reusable patterns the book proposes for doing this.

How my team do root cause analysis

This blog is more or less a copy and paste of a wiki page that my team at work use as part of our Problem Management process.  It is heavily inspired by lots of good writing about blameless postmortems for example from Etsy and the Beyond Blame book.  Hope you find it useful.

RCA Approach

 

This page describes a 7 step approach to performing RCAs.  The process belongs to all of us, so please feel free to update it.

Traditionally RCA stands for Root Cause Analysis.  However, there are two problems with this:

  1. It implies there is one root cause.  In practice it is often a cocktail of contributing causes as well as negative (and sometimes positive) outcomes
  2. The name implies that we are on a hunt for a cause.  We are on a hunt for causes, but only to help us identify preventative actions.  Not just to solve a mystery or worse find an offender to punish.

Therefore RCA is proposed to stand for Recurrence Countermeasure Analysis.

Step 1: Establish “the motive”

Ask the following:

Question: Does anyone think anyone in our team did something deliberately malicious to cause this?  i.e. they consciously carried out actions that they knew would cause this or something of similar negative consequences or they clearly understood the risks but cared so little that they weren’t deterred?

and

Question: Does anyone think anyone outside our team… (as above).

The assumption here is that the answer is “NO” to both questions.  If it is “NO”, we can now proceed with a blameless manner, i.e. never stopping our analysis at a point where a person should (or could) have done something different.

If either answers are “YES”.  This is beyond the scope of this approach.

Step 2: Restate our meaning of “Blameless”

Read aloud the following to everyone participating in the RCA:

“We have established that we don’t blame any individual either internal or external to our organisation for the incident that has triggered this exercise.  Our process has failed us and needs our collective input to improve it.  If at any point during the process anyone starts to doubt this statement or act like they no longer believe it we must return to Step 1.  Everyone is responsible for enforcing this.

What is at stake here is not just getting to the bottom of this incident, it’s getting to the bottom of this incident and every future occurrence of the same incident.  If anyone feels mistreated by this process, by human nature they will take actions in the future to disguise their actions to limit blame and this will damage our ability to continuously improve.”

Step 3: Restate the rules

During this process we will follow these rules:

  1. Facts must not be subjective.  If an assertion of fact cannot be 100% validated we should agree and capture our confidence level (e.g. High, Medium, Low).  We must also capture the actions that we could do to validate it.
  2. If we don’t have enough facts, we will prioritise the facts that we need go away and validate before reconvening to continue.  Before suspending the process, agree a full list of “Things we wish we knew but don’t know”, capture the actions that we could do to validate them and prioritise the discovery.
  3. If anyone feels uncomfortable during the process due to:
    1. Blame
    2. Concerns with the process
    3. Language or tones of voice
    4. Their ability have their voice heard they must raise it immediately.
  4. We are looking for causes only to inform what we can do to prevent re-occurrence, not to apportion blame.

Step 4: Agree a statement to describe the incident that warranted this RCA

Using an open discussion attempt to reach a consensus over a statement that describes the incident that warranted this RCA.  This must identify the thing (or things) that we don’t want to happen again (including all negative side-effects).  Don’t forget the impact on people e.g. having to work late to fix something.  Don’t forget to capture the problem from all perspectives.

Write this down somewhere everyone can see.

Step 5: Mark up the problem statement

Look at the problem statement and identify and underline every aspect of the statement that someone could ask “Why” about.  Try to take an outsider view, even if you know the answer or think something cannot be challenged, it is still in scope for being underlined.

Step 6: Perform the analysis

Document the “Why” question related to each underlined aspect in the problem statement.

For each “Why” question attempt to agree on one direct answer.  If you find you have more than one direct answer, split your “Why” question into enough more specific “Why” questions so that your answers can be correlated directly.

Mark up the answers as you did in Step 5.

Repeat this step until you’ve built up a tree with at least 5 answers per branch and at least 3 branches.  If you can’t find at least 3 branches, you need to ask more fundamental “Why” questions about your problem statement and answers.  If you can’t ask and answer more than 5 “Why”s per branch possibly you are taking too large steps.

Do not stop this process with any branch ending on a statement that could be classified “human error”.  (Refer to what we agreed at step 1).

Do not stop this process at something that could be described as a “third party error”.  Whilst the actions of third parties may not be directly under our control, we have to maintain a sense of accountability for the problem statement where if necessary we should have implemented measures to protect ourselves from the third party.

Step 7: Form Countermeasure Hypothesis

Review the end points of your analysis tree and make hypothesis’ about actions that could be taken to prevent future re-occurrences. Like all good hypothesis’ these should be specific and testable.

Use whatever mechanism you have for capturing and prioritising the proposed work to track the identified actions and get them implemented.  Use your normal approach to stating acceptance criteria and don’t close the actions unless they satisfy the tests that they have been effective.

 

Using ADOP and Docker to Learn Ansible

As I have written here, the DevOps Platform (aka ADOP) is an integration of open source tools that is designed to provide the tooling capability required for Continuous Delivery.  Through the concept of cartridges (plugins) ADOP also makes it very easy to re-use automation.

In this blog I will describe an ADOP Cartridge that I created as an easy way to experiment with Ansible.  Of course there are many other ways of experimenting with Ansible such as using Vagrant.  I chose to create an ADOP cartridge because ADOP is so easy to provision and predictable.  If you have an ADOP instance running you will be able to experience Ansible doing various interesting things in under 15 minutes.

To try this for yourself:

  1. Spin up and ADOP instance
  2. Load the Ansible 101 Cartridge (instructions)
  3. Run the jobs one-by-one and in each case read the console output.
  4. Re-run the jobs with different input parameters.

To anyone only loosely familiar with ADOP, Docker and Ansible, I recognise that this blog could be hard to follow so here is a quick diagram of what is going on.

docker-ansible

The Jenkins Jobs in the Cartridge

The jobs do the following things:

As the name suggests, this job just demonstrates how to install Ansible on Centos.  It installs Ansible in a Docker container in order to keep things simple and easy to clean up.  Having build a Docker image with Ansible installed, it tests the image just by running inside the container.

$ ansible --version

2_Run_Example_Adhoc_Commands

This job is a lot more interesting than the previous.  As the name suggests, the job is designed to run some adhoc Ansible commands (which is one of the first things you’ll do when learning Ansible).

Since the purpose of Ansible is infrastructure automation we first need to set up and environment to run commands against.  My idea was to set up an environment of Docker containers pretending to be servers.  In real life I don’t think we would ever want Ansible configuring running Docker containers (we normally want Docker containers to be immutable and certainly don’t want them to have ssh access enabled).  However I felt it a quick way to get started and create something repeatable and disposable.

The environment created resembles the diagram above.  As you can see we create two Docker containers (acting as servers) calling themselves web-node and one calling it’s self db-node.  The images already contain a public key (the same one vagrant uses actually) so that they can be ssh’d to (once again not good practice with Docker containers, but needed so that we can treat them like servers and use Ansible).  We then use an image which we refer to as the Ansible Control Container.  We create this image by installing Ansible installation and adding a Ansible hosts file that tells Ansible how to connect to the db and web “nodes” using the same key mentioned above.

With the environment in place the job runs the following ad hoc Ansible commands:

  1. ping all web nodes using the Ansible ping module: ansible web -m ping
  2. gather facts about the db node using the Ansible setup module: ansible db -m setup
  3. add a user to all web servers using the Ansible user module:  ansible web -b -m user -a “name=johnd comment=”John Doe” uid=1040″

By running the job and reading the console output you can see Ansible in action and then update the job to learn more.

3_Run_Your_Adhoc_Command

This job is identical to the job above in terms of setting up an environment to run Ansible.  However instead of having the hard-coded ad hoc Ansible commands listed above, it allows you to enter your own commands when running the job.  By default it pings all nodes:

ansible all -m ping

4_Run_A_Playbook

This job is identical to the job above in terms of setting up an environment to run Ansible.  However instead of passing in an ad hoc Ansible command, it lets you pass in an Ansible playbook to also run against the nodes.  By default the playbook that gets run installs Apache on the web nodes and PostgreSQL on the db node.  Of course you can change this to run any playbook you like so long as it is set to run on a host expression that matches: web-node-1, web-node-2, and/or db-node (or “all”).

How the jobs 2-4 work

To understand exactly how jobs 2-4 work, the code is reasonably well commented and should be fairly readable.  However, at a high-level the following steps are run:

  1. Create the Ansible inventory (hosts) file that our Ansible Control Container will need so that it can connect (ssh) to our db and web “nodes” to control them.
  2. Build the Docker image for our Ansible Control Container (install Ansible like the first Jenkins job, and then add the inventory file)
  3. Create a Docker network for our pretend server containers and our Ansible Control container to all run on.
  4. Create a docker-compose file for our pretend servers environment
  5. Use docker-compose to create our pretend servers environment
  6. Run the Ansible Control Container mounting in the Jenkins workspace if we want to run a local playbook file or if not just running the ad hoc Ansible command.

Conclusion

I hope this has been a useful read and has clarified a few things about Ansible, ADOP and Docker.  If you find this useful please star the GitHub repo and or share a pull request!

Bonus: here is an ADOP Platform Extension for Ansible Tower.

ADOP with Pivotal Cloud Foundry

As I have written here, the DevOps Platform (aka ADOP) is an integration of open source tools that is designed to provide the tooling capability required for Continuous Delivery.

In this blog I will describe integrating ADOP and the Cloud Foundry public PaaS from Pivotal.  Whilst it is of course technically possible to run all of the tools found in ADOP on Cloud Foundry, that wasn’t our intention.  Instead we wanted to combine the Continuous Delivery pipeline capabilities of ADOP with the industrial grade cloud first environments that Cloud Foundry offers.

Many ADOP cartridges for example the Java Petclinic one contain two Continuous Delivery pipelines:

  • The first to build and test the infrastructure code and build the Platform Application
  • The second to build and test the application code and deploy it to an environment built on the Platform Application.

The beauty of using a Public PaaS like Pivotal Cloud Foundry is that your platforms and environments are taken care of leaving you much more time to focus on the application code.  However you do of course still need to create an account and provision your environments.

  1. Register here
  2. Click Pivotal Web Services
  3. Create a free tier account
  4. Create and organisation
  5. Create one or more spaces

With this in place you are ready to:

  1. Spin up and ADOP instance
  2. Store your Cloud Foundry credentials in Jenkins’ Secure Store
  3. Load the Cloud Foundry Cartridge (instructions)
  4. Trigger the Continuous Delivery pipeline.

Having done all of this, the pipeline now does the following:

  1. Builds the code (which happens to be the JPetStore
  2. Runs the Unit Test and performs Static Code Analysis using SonarQube
  3. Deploys the code to an environment also known in Cloud Foundry as a Space
  4. Performs functional testing using Selenium and some security testing using OWASP ZAPP.
  5. Performs some performance testing using Gatling.
  6. Kills the running application in environment and waits to verify that Cloud Foundry automatically restores it.
  7. Deploys the application to a multi node Cloud Foundry environment.
  8. Kills one of the nodes in Cloud Foundry and validates that Cloud Foundry automatically avoids sending traffic to the killed node.

The beauty of ADOP is that all of this great Continuous Delivery automation is fully portable and can be loaded time and time again into any ADOP instance running on any cloud.

There is plenty more we could have done with the cartridge to really put the PaaS through its paces such as generating load and watching auto-scaling in action.  Everything is on Github, so pull requests will be warmly welcomed!  If you’ve tried to follow along but got stuck at all, please comment on this blog.

Join the DevOps Community Today!

As I’ve said in the past, if your organisation does not yet consider itself to be “doing DevOps” you should change that today.

If I was pushed to say the one thing I love most about the DevOps movement, it would be the sense of community and sharing.

I’ve never experienced anything like it previously in our industry.  It seems like everyone involved is united by being passionate about collaborating in as many ways as possible to improve:

  • the world through software
  • the rate at which we can do that
  • the lives of those working our industry.

The barrier to entry to this community is extremely low, for example you can:

You could also consider attending the DevOps Enterprise Summit London (DOES).  It’s the third DOES event and the first ever in Europe and is highly likely to be one of the most important professional development things you do this year.  Organised by Gene Kim (co-author of The Phoenix Project) and IT Revolution, the conference is highly focused on bringing together anyone interested in DevOps and providing them as much support as humanly possible in two days.  This involves presentations from some of the most advanced IT organisations in the world (aka unicorns), as well as many from those in traditional enterprises who may be on a very similar journey to you.   Already confirmed are talks from:

  • Rosalind Radcliffe talking about doing DevOps with Mainframe systems
  • Ron Van Kemenade CIO of ING Bank
  • Jason Cox about doing DevOps transformation at Disney
  • Scott Potter Head of New Engineering at News UK
  • And many more.

My recommendation is to get as many of your organisation along to the event as possible.  They won’t be disappointed.

Early bird tickets are available until 11th May 2016.

(Full disclosure – I’m a volunteer on the DOES London committee.)

London Banner logo_770x330

Reducing Continuous Delivery Impedance – Part 5: Learned Helplessness

Nearly two years ago, I started this blog series to describe the main challenges I’d experienced trying to implement Continuous Delivery.  At the time, the last post in the series was about four challenges related to people.  Since then I’ve observed a fifth challenge and discovered it has been studied in psychology and has a name.

In this post I’ll attempt to describe how to recognise and tackle Learned Helplessness.  Please share your comments (especially if my Psychology-by-Wikipedia needs guidance).

Through various interactions with clients, at meetups, conferences and even with my own team, I’ve witnessed the following phenomena:

  • Something is done (or not done) on an engagement that makes Continuous Delivery difficult (for example the development team accepting SonarQube saying some seriously defamatory things about their unit test coverage but neglecting even to gradually address this).
  • When questioned:
    • many people already appreciate that this is very wrong.
    • hardly anyone can really explain or justify why this is happening.
    • hardly anyone seems worked up about a solution.

It gave me an impression that people had experienced good practice in the past, but having joined this particular engagement had somehow lost the inclination to do it.  It’s possible that for some people, in the past when things just worked, they didn’t question it, so never really appreciated the value of particular practices.  But I think most people are more analytical than that.  I started to realise that people probably had gone through an experience like this:

  • Joined the engagement, didn’t understand why certain things were / weren’t done, but opted to observe before speaking up.
  • Realised things actually weren’t magically working in some new logic- / experience- defying way.
  • Spoke up but didn’t really get listened to.
  • Spoke up again several times , but didn’t really ever get listened to.
  • Gave up and accepted things for the sorry way that they are.

I figured there must be a name for this, started googling and realised it is called Learned Helplessness, something that was first experimented in the 1960’s by some scientists we can probably assume weren’t dog lovers…

The experiments are best described here on Wikipedia but in extremely simplified form:

  1. some dogs were given no random electric shocks,
  2. some dogs were given shocks and also given a button to press to disable the shocks,
  3. some dogs received shocks at the same time as group 2 dogs but had no button.  Group 3 dogs were paired with Group 2 dogs and were shocked until their Group 2 pair happened to press the button (which was at a random time from the Group 3 dog’s perspective).

The learned helplessness of Group 3 was demonstrated in the second part of the experiments when dogs had the opportunity to cross over a small wall to avoid getting shocks.  Whereas groups 1 and 2 quickly learned how to avoid shocks, group 3 all failed to learn and sat their accepting their fate in pain.

http://wariscrime.com/new/wp-content/uploads/2015/02/seligman-2.jpg

The similarity of the above diagram to diagrams about DevOps like this made me smile!

Subsequent experiments demonstrated the ineffectiveness of threats or even rewards on motivating group 3 to change their location.  Only by physically teaching the group 3 dogs to move more than twice did they learn to overcome the helplessness.  Later experiments also proved the same phenomena in humans (without electricity).

So how do we overcome this?

Here are some things I’m experimenting with:

  • Try some introspection – ask yourself what you’ve learnt to accept, really look around for things that are stopping your project going faster – no matter how obvious, and start to ask why, perhaps at least 5 times.
  • Ask others around you ideally at all levels of experience less, the same and more than you what they think is preventing learning and improvement and consider asking “5 Whys” with them.
  • Pay close attention to new joiners to your team – they are the only ones not yet infected by Learned Helplessness.
  • Be sensitive with people.  No-one wants to be told they are “helpless” or hear your amateur psychobabble.  Tread carefully.
  • If you are looking to impart a change, don’t over estimate the impact of threatening or incentivising the people who need to change – they may already be too apathetic.  Instead expect to need to show them multiple times:
    • That the proposed change is possible.  You need to demonstrate it to them (for example if it relates to Continuous Delivery something like the DevOps Platform may help make things real).
    • That their opinions count and they have an important voice.

How is Learned Helplessness harming your organisation and to what extent are you suffering?