Finding your path with Site Reliability Engineering (SRE)

In this blog I am excited to share a simple decision tree tool that I’ve developed with teams who are getting started with Site Reliability Engineering (SRE). I think it’s worth first understanding a bit of context, but if you prefer, feel free to dive right in here.

SRE is a set of principles and practices that can help organisations with the perpetual challenge of balancing changes to IT systems with the reliability, resilience, and operability of the corresponding production services.  The concepts build upon things that we have believed and practiced for a long time, but over the last two years many of our teams have taken fresh inspiration from adapting and applying ideas from SRE.

The SRE body of knowledge is broad and can support teams in considering topics a diverse as culture, major incident management, team topologies, observability, and architecture.   SRE can be useful as a team construct, a role, or simply to develop processes that any team can adopt.  Some teams find this diversity of content to be accessible and just what they need.  A lot of teams however just want to know:

How do we actually get started with SRE, what should we do today?” (many people)

In the past I tried distilling the bits I’ve found most differentiating about SRE into this self test script. I advocate the use of SRE to improve measurement and feedback loops for two important priorities:

  • Quality of service – within the bounds of the functional capability, is the service that the user/customers good enough?
  • Service Operability – are we happy with the impact that supporting this service has on our organisation and our colleagues?

The decision tree follows this format and provides step by step suggestions on what to do next within your context. I also briefly tackle the financial motivations.

I hope you are now intrigued enough to click here and discover whether the tool helps you with your SRE journey. Please let me know if you find it useful or even better send a pull request on Github!

Defining an enterprise PaaS strategy – Part 3 of 3: Re-use versus Coupling

In part 1 I introduced the topic of building a PaaS strategy that is effective for major enterprises.  In part 2 I made a strong argument to buy as a service rather than self-manage.  Here I will share my thoughts on balancing re-use and sharing of PaaS instances versus the potentially hidden onward impact on agility and safety.

Number of PaaS Instances and PaaS Fallibility

There is a very easy trap to fall into with PaaS which I admit to have found myself caught by in the past – believing that all of the clever resilience makes it overall infallible.  After all we know PaaS’ are built to handle scheduling, re-scheduling, and scaling of compute workloads.  We know they are built without single points of failure on all critical components.  We know they have redundancy in storage probably at multiple logical and physical levels.  But we also mustn’t forget this: they are complex software and humans are involved.  For all the clever “magic” in a PaaS, they can have down time and that can be in a very ungraceful “nothing is working” and possibly even “a part of the PaaS you’ve never even heard of is complaining” manner.  If you have a PaaS instance, you have to expect that as some point it will be 100% down, taking every last micro-service with it.

The first indication of this unfortunate truth is that you need an approach to having redundancy at the PaaS level.  If availability really matters, you are likely going to need more than one PaaS instance (and that is definitely not the same as one PaaS instance spread across multiple availability zones or even regions).  You are also going to need to have an approach to fast restoration of service.  For example if you have 2 PaaS instances and one goes down, you need to be able to rebuild the first instance before you lose the only one you have left.

As an aside, a quick thank you to Michael Nygard for his excellent explanations of coupling which really helped me understand the topic more clearly.

Failure-state coupling

In virtualisation where you have a hypervisor turning a physical machine into multiple logical virtual machines, there is a well-established concept of a “noisy neighbour”.  This means where one virtual machine starts taking more than it’s fair or perhaps even allocated resources and starts starving an impacting other local virtual machines.  Perhaps the noisy neighbour VM has a memory leak and is becoming unresponsive.  There is a danger that this failure effectively jumps to other VMs through them slowing down and before you know it, you have failure cascading.  I think of this as failure-state coupling.

There is exactly the same possible problem with a PaaS instance.  The implication of this is very simple: you have to think about what services you actually want sharing a PaaS instance.  For example your enterprise has a mortgage advisor virtual agent application and it has a payments platform.  Do you ever want to be in the situation where your mortgage advisor agent takes down a shared PaaS and you lose your payments platform.  This concept is called isolation.  It’s well established in traditional data centres and usually achieved at multiple physical, network, and logical levels.  Don’t throw that wisdom out via PaaS exuberance.

So the question of how many PaaS instances has to be answered.  The decision needs to factor in both redundancy and isolation.  These are very likely to lead you to wanting a handful of PaaS instances of perhaps more (even before you factor in instances for hosting test environments).  It’s a shame that the possibility of one highly leveraged and centralised strategic platform instance doesn’t weather reality, but run-time coupling can be too unpalatable.


So we’ve accepted that we need multiple PaaS instances.  At least we can have a wonderfully industrialised PaaS factory, centre of excellence, PaaS instances self-service, PaaS instances SRE team, right?  Unfortunately despite the promise of meeting traditional values of industrialisation, re-use, and cost cutting, there is a down side which I like to think of this as dev-time coupling.  If we create one enterprise team responsible for creating and operating standardised and hardened PaaS instances we create a dependency upon them and therefore the possibility of queues and teams blocking other teams.  Even if the team is just responsible for taking PaaS and serverless services from the public cloud, testing them, and adding some standardised provisioning and configuration code (Terraform and Ansible for example), you have created team coupling.  You have created the very real chance that one part of your organisation desperately wants to use a new PaaS service from AWS and they can’t get it for 6 months because it isn’t high enough on the centralised PaaS factory’s agenda.  (Even perhaps is all the PaaS factory would be required to do is bless it and perhaps make some IAMs changes.)

When creating a PaaS strategy you need to think hard about how much development efficiency benefit having all teams using consistent PaaS instances is really going to bring versus the potential business impacts through lost autonomy and agility.  In my opinion you might as well face up to having multiple different PaaS solutions in different teams and give up trying to make one uber consistent version everywhere.  But what about the wild west and sprawl I hear you asking?  For me this is where we can turn to Conways Law (and hopefully the use we’ve already made of it).  If an organisation has split into business function aligned, more autonomous end to end groups, those boundaries work here.  Firstly I think those domains should still be given agency to make decisions about their PaaS implementation (not to say they can’t use that to opt for re-use of other implementations if appropriate).  Secondly they might be an appropriate boundary for internal consistency i.e. they accept dev-time coupling within their own group.  When organisations have taken the steps of making governance structures like this, it can be a mistake undermine this through an inflicted PaaS implementation.

Operation-time coupling

Even, if we reject the idea that it’s fine to have multiple PaaS implementations and we are resolute on everyone having consistent a version, having almost as many different versions of your consistent PaaS implementation running in production as you have PaaS instances is inevitable.

Firstly I need to deliver one final home truth about PaaS, not only are they hard to upgrade but you also need to upgrade them very regularly.  And this doesn’t just mean nested internal software that you don’t notice, often these are breaking changes affecting the API and contract between the PaaS’ and your code.

The idea of developing a new version of your PaaS implementation (perhaps in response to one team wanting new features) and the rolling it out big bang to every instance is very naïve.  Think about upgrading just one instance that say runs 100 micro-services, each upgrade exercise has the possibility of impacting every single one conceivably in different ways.  That is a lot of testing to co-ordinate and definitely isn’t something you want to make wider in scope than necessary.  Let’s say you have 5 PaaS instances across your enterprise for different business aligned groups, are you really going to update them all on the same day / week / even month?  You’ll be lucky not to have 5 different versions all running in production.  I am not defending avoiding upgrades and drifting towards running out of support cover and increasing security vulnerability, but it’s hard enough co-ordinating for 1 team let alone 5.


So it’s a wrap: what I think an enterprise PaaS strategy should do, where to get PaaS’ from, and then here in part three – how consider the true cost of re-use versus coupling.

Defining an enterprise PaaS strategy – Part 2 of 3: As a Service

In part 1 I introduced the topic of building a PaaS strategy that is effective for major enterprises.  Here I will justify my recommendation for buying as a service over building and self running PaaS’.

Why As-a-service?

A long time mantra of the clouderati (and something we no doubt borrowed from another field) has been “don’t do undifferentiated heavy lifting”.  This means if something is hard and there are diminished returns in terms of how well you can do it in comparison to your competitors, you should look to receive it as a service.  Obviously when procuring something as a service, the service needs to be at least as good as you can do it.  By “good” I essentially mean price, quality, and overall effectiveness.

In IaaS this means – can you create your own compute, network, and storage requirements better than a public cloud provider.  “Better” means cost, variety, scalability, reliability, security etc.  For PaaS, all of the above logic applies: does how well you can create a platform to run your software applications present any significant advantage over what your competitors could either do themselves or purchase as-a-service.

As of June 2019, AWS made their Kubernetes PaaS service generally available joining Azure and GCP (and many others) with competing offerings.  Whilst these things are still relatively new they are improving daily.

Without a doubt installing and running a PaaS application (be that OpenShift, Kubernetes, Cloud Foundry etc.) is heavy lifting.  You are undertaking operation of something which is orders of magnitude more complicated that what it takes to run individual micro-services.  You are running a distributed and dynamic multi-tier piece of software that contains 10’s of components and millions of lines of code written in over 10 languages, oh and it’s open source.

Can your PaaS’ be a source of business differentiation compared to PaaS instances you can get from a cloud provider?  I doubt it.

Can you grow skills in house to do this that can compete with cloud providers doing the same thing at hyper scale?  I’m afraid not.

No longer a greenfield?

But alas the world is not quite so simple and greenfield, and even with acceptance of my opinions above, a number of reasons compel organisations to run their own PaaS’.  First the ones I disagree with:

  1. Unwillingness to use public cloud at all.
  2. Unwillingness to use anything above IaaS from a public cloud provider for fear of:
    1. Lock-in
    2. Additional shared security responsibility
    3. Distrust about their ability to make it reliable.
  3. Unwillingness to use Open Source.

And ones that I accept as unfortunate but reasonable:

  1. They started before public PaaS’ (at least from their preferred cloud provider) were fully available.  Migrating is hard to create a business case for (even ignoring any sunk cost fallacies).
  2. They have a lot of infrastructure in a private data centre (on prem) and the benefits of repurposing that to host a PaaS outweigh the costs of implementing and running a PaaS on it.
  3. They have a large estate on prem, a small planned workload in the cloud and network traffic between the two is prohibitive in terms of cost, or latency (or perceived security).

So to summarise your Enterprise PaaS strategy should have a view on how much of your PaaS’ (in terms of your different instances and in terms of how much of the stack each instance is composed of) you will buy as a service and how much you will create as a service (and how you might increase this over time).

In part 3 I will cover the trade-off between re-use and coupling.


Defining an enterprise PaaS strategy – Part 1 of 3: Introduction

Large enterprises need platform teams and platform applications (aka platform services or PaaS).  In this blog series I will explore how to formulate a PaaS strategy and the key decisions around: high level strategy, where to get your PaaS’ from, how many you need, and how to manage them.

Why PaaS?

It is widely believed that custom software is a powerful solution for underpinning differentiated business functions.   It is also a popular idea that building software in very small (“micro”), composable network-invoked services is leads to more agile, scalable, and reliable solutions.  This presents IT functions with a challenging requirement that may be relatively new to them: what is the most effective way to run all of this new dynamic and granular custom code?  Fortunately there is an aaS for that in Platform as a Service (PaaS).

As I’ve pointed out in the past, things called as-a-service are different things according to who is providing the service to who.  I find it clearer to think of Platforms as a being a product.  These products are themselves a software application that shields you from the intricacies of infrastructure, and presents you with a suitable place to deploy and operate your software services.  Done well they promise a perfect balance of reliability, security, safety, usability, and re-use.

Traditional / large enterprises now need strategies for governing their use of such software hosting platforms and this is what I want to opine on.

High level strategy

I believe enterprises should:

  • Give up on the idea of one consistent PaaS implementation let alone one instance.  Instead they need strategies to cope with heterogenous PaaS solutions (both in terms of implementations and running instances) and to become comfortable and effective at owning them.
  • Buy PaaS instances as a service from existing cloud providers.
  • Practice lots of PaaS disaster recovery (DR) scenarios until each PaaS instance meets the required time to restore (TTR) targets.

They need a strategy that empowers teams to use these amazing and now abundant technologies.  The number one priority needs to be safety followed closely by enablement of autonomy and agility.  Everything else needs to fall in line.

In part 2 I will talk about buying as a service versus self managed. Finally I’ll cover the trade off of re-use and sharing versus coupling in part 3.

The Importance of Developer Experience in the PaaS Age

Platforms make or break software applications (literally) and they can make or break whole organisations.  Platform teams can also be a highly successful team topology as I’ve covered here.

At this time of rapid advancement in public cloud platform services, anyone with a credit card has ready access to powerful capabilities.  This erodes the advantage that established organisations have over new market entrants. Every day platform capabilities (like container scheduling, caches, GPU farms etc.) become more commoditised.  Every day the importance of integrating these technologies and delivering them internally as effective platform services becomes more important.  

I believe large enterprises need to pivot their existing platforms to meet modern standards and this doesn’t just mean technical capability, but also usability.  The term Developer Experience (DX) can be defined as:

the equivalent of User Experience when the primary user of the product is a developer. DX cares about the developer experience of using a product, its libs, SDKs, documentation, frameworks, open-source solutions, general tools, APIs, etc.

It is nearly 10 years old as a concept with its origins in tech companies looking to get third party developers to use their APIs.  Lots of things impact DX as described excellently here but it is the factors related to platforms (and what to do about them) that I want to focus on.

Key Factors Impacting DX when using a platform

Whilst also a function of things like the platform architecture, the API, and documentation, the following factors are what I believe matter most influence developer experience around a platform.

Minimise Extraneous Cognitive Load  

Psychologist John Sweller created the concept of Cognitive Load which characterises how much ‘thinking effort’ a particular ‘thing’ needs in order to be done effectively.  A good platform will enable developers to devote their brain energy to Germane Cognitive load which is essentially the creative part of performing their work.  It will minimise the Extraneous Cognitive load which means effort thinking about things like how to get code compiled or deployed.  A platform with challenges in this area may have high demands for Extraneous Cognitive load leading developers to waste more effort thinking about how to use the platform rather than what their code should be doing on it.

Maximise The Opportunity for Flow

Another important concept relevant to developer experiences is Flow as defined by psychologist Mihaly Csikszentmihalyi.  In a Flow state of working, people are completely absorbed in a task and hopefully both productive and feeling in their happiest state.  A good platform should enable a workflow that lets developers focus without being interrupted by their tools. This means times that they are waiting for computer tasks to complete like committing code, compiling, or getting feedback from tests needs to be absolutely minimised.  A platform with challenges may hold developers up so badly that it is impossible for them to work in a Flow like state.

Quality of Platform Service

The performance and consistency of the platform including tools and test environments is of vital importance to developer effectiveness.  People very quickly become accustomed to a poor user experience and at that point learned helplessness kicks in, they suffer in silence, and it becomes harder to understand how the platform may be negatively affecting their productivity and engagement.

Improving the platform impacts upon DX

Here are my thoughts on how to drive improvements.

Housing the challenge of improving DX

The challenge of creating effective DX involves factors beyond the scope of any one team.  Factors include the platform architecture and API, skills of the engineering team, other common platform services etc.  Naming a team after DX could be taken to imply that the team are solely responsible for it.  DX should be promoted as something achieved by multiple teams working both individually and together.  That said it can be effective to implement a team to own these complementary but different roles:

  1. Providing some tools as services to developers and as part of the overall platform API.
  2. Helping teams improve their delivery effectiveness through:
    • studying the DX of engineering teams, 
    • involving the right parties, 
    • driving experiments to learn how to improve it.

When focusing on this, it can be helpful to seek input from general (as in not just platform) UX specialists, and also (if you have them) any public API teams and DevRel people.

Studying Developer Experience

A collaborative approach to improving DX can benefit from:

  • Cross team terminology for describing it.
  • Common DX metrics for measuring it
  • Joint initiatives for learning how to improve it.

This is of course reliant on a psychologically safe culture where teams are comfortable sharing their challenges and failures with each other.

It would be valuable to understand more about the current state of DX at an organisation.  This could include:

  • Tracking and sharing internally a baseline of above mentioned DX metrics.
  • Working closely with a rotating VIP engineering team.  (Has the side benefit that they can become advocates.)
  • Analysing Request, Incident and Problem Management of all tickets created by developers and related to the platform.
  • Reviewing metrics around the reliability of the tools (using Service Level Indicators from SRE).
  • Creating personas for engineering team members (considering things like the technologies they use, their proficiency, their team release cadences).
  • Gathering more user analytics directly from the tools e.g. compile times, invalid commands attempted by users.

Experimenting and Learning

With the above foundation of an improved collective understanding of DX, we can try to make changes to drive improvements.  Hopefully the insights will lead to some quick wins that can be implemented immediately.

An approach to more involved change could include:

  • Creating a value framework to help create a consensus about how we can measure and demonstrate improvement. 
  • Correlating our values to the DX metrics.
  • Ensuring we can baseline performance against the metrics.
  • Creating a hypothesis about how an improvement may improve one of the metrics we care about and hence deliver value.
  • Refining the hypothesis to create the lowest cost experiment that shows the fastest results.
  • Prioritising hypothesis’ using expected value divided by estimated effort.
  • Performing experiments and iteratively driving improvements.


I hope this has got you thinking about the platform parts of DX.  I would love to hear your own ideas about how to improve them.

Team Topologies Book Summary – Part 3 of 3: Taking Action

In part 1 of this 3 part blog series about the Team Topologies book, I summarised a large set of general things that can help make teams successful. In part 2 I covered the books ideas on team design and team interaction modes.  What an epic so far, and this has just been my summary of parts of the book with much less detail and specific advice.  In this final blog of the series I will leave you with some things to consider doing with all of this excellent information.

From the general guidance for effective teams, consider:

  • Are teams too large?
  • Do they have the right working environment?
  • Do they treat other team members well enough?
  • Are teams being allowed to stay together long enough?

Using the concept of Cognitive load, consider:

  • Do teams have the right skills to handle the intrinsic cognitive load expected upon them?
  • Do teams use enough automation to minimise extrinsic cognitive load?
  • Do teams have effective interaction modes to other teams to minimise extrinsic cognitive load?
  • Do teams have a manageable amount of scope (germane cognitive load) or should they consider exploiting a fracture plain to divide up?
  • Could SRE processes be adopted to keep cognitive load down by continuing to divide it between a Dev and Ops (SRE) team, but also to improve overall alignment and resilience and agility?

The terminology for team types and interaction types in the book is extremely helpful when thinking about the teams in your organisation.

Using the topologies, consider:

  • Which types do our existing teams align to?
  • If our teams do align to a type, are they following the recipes for success in the book or falling into the anti-patterns?
  • If our teams do not align to a type in the book, should they be altered so that they do?  Or do we really want to customise the types for our needs?
  • Are our teams effectively using the interaction types recommended in the book, or if not, are they communicating too much, and generating too much cognitive load?

If existing team structures do not fit the advice of this book and you see room for improvement, what factors have influenced them to become as they are today (e.g. budgeting / finance models)?  Are these things easily surmountable?

I hope this has all been useful and you are now inspired to read the full book, to get much deeper into this vital topic, and to start applying it to your own organisations!

I think the authors Matthew and Manuel did an awesome job and would like to thank them again for writing it.

Team Topologies Book Summary – Part 2 of 3: Topologies and Interaction Modes

In part 1 of this 3 part blog series about the Team Topologies book, I summarised a large set of general things that can help make teams successful.  In this part I will get to actual team design i.e. types of team and how they should interact..

The Four Team Types

Based on their extensive research, the authors present 4 types of team and make an important assertion: to be an effective organisation you only need a mixture of teams conforming to these 4 types.

Personally, I really identified with the 4 types, but even if you have different opinions, it is still very useful to be able to reference and extend / update the common set of types described (in such great detail) in the book.

The 4 types are as follows.

Stream Aligned Teams

This is the name the authors coined for an end-to-end team that owns the full value stream for delivering changes to software products or services AND operating them.  This is the stereotypical modern and popular “you build it, you run it team”.

For Stream Aligned teams to be able to do their job, they need to be comprised of team members that collectively have all of the different skills they need e.g. business knowledge, architecture, development and testing skills, UX, security and compliance knowledge, monitoring and metrics skills etc.  This enables them to play nicely with the requirement of minimising queues and handoffs to other teams (especially in comparison to teams comprised of people performing singular functions e.g. testing).

The expectation is that by teams owning their whole value stream including the performance of the system in production, they can optimise for rapid small batch size changes, and reap all of the expected benefits around both agility and safety.  The hope is also that this end to end scope might help teams achieve “autonomy, mastery and purpose” (things which Daniel Pink highlights as most important to knowledge workers).

The amount of cognitive load required of them is a factor of a few things:

  • If they have enough of the requisite skills in their team, intrinsic cognitive load will be manageable.
  • If the modes of interaction with other teams are clean and efficient, their extrinsic cognitive load shouldn’t be too high.  They can also minimise this within their team through automation.
  • If they were formed using fracture planes effectively enough that they can keep their scope (and therefore their germane cognitive load) to a manageable amount.

Overall this seems like a very compelling team type and indeed the book recommends that the majority of teams in organisations have this form.

The book goes into a lot more detail about how to make a Stream Aligned Team effective which I highly recommend reading.

If the book only included Stream Aligned Teams, I would have considered it a cynical attempt to document fashionable ideas and ignore the factors that have led many people to have other team types.  Fortunately, instead they did a great job at considering the whole picture via 3 other types (plus bonus type SRE – read on!)

Platform Teams

In a tech start-up that begins with just one Stream Aligned Team owning the full stack and software lifecycle end-to-end, at some point they will face into Dunbar’s Number and cognitive overload.  It will be time to look for a fracture planes to enable splitting into at least two teams.  Separating the platform from the application is a very successful and well-established pattern.  At this point we could potentially have one Stream Aligned Team working on the application and one on the platform.

Let’s say the business and demand for application complexity continues to increase and the application team splits into two teams focussed on different application value streams.  We’re now in a situation where they both most likely re-use the same platform team and we get the benefit of re-using it.  This could of course repeat many times.  The authors recognised the nature of running a platform team (especially in terms of the type of coupling to other teams and effective modes of communication) differed enough to create a new team type and this they called Platform Team.

Platform Teams create and operate something re-usable for meeting one or more application hosting requirements.  These could be running platform applications like Kubernetes, a Content Management System as a service, IaaS, or even teams wrapping third party as-a-service services like a Relational Database Service.  They may also be layering on additional features valuable to the consumers of the platform such as security and compliance scanning.

But there are potential traps with platform teams.  If the platform is too opinionated and worse usage is also mandated within an organisation the impedance mismatch may be more harm than good for consuming applications.  If a platform isn’t observable and applications are deployed into a black box, the application teams will be disconnected from enough detail about how their application is performing in production and disempowered to make a difference to it.  The book goes into a lot of detail about how to avoid creating bad platform teams and strongly makes an argument for keeping platforms as thin as possible (which it calls minimum viable platform).

Personally, I think the dynamic that occurs when consuming a platform from a third party is very powerful for a number of reasons:

  • The platform provider is under commercial pressure to create a platform good enough that consumers pay to use it.
  • The platform provider has to be able to deliver the service at a cost point below the total revenue it’s users will give it (so it will be efficient and chose it’s offered services carefully)

If organisations can at least try to think like that (with or without internal use of “wooden dollars”) I think they will create an effective dynamic.  The book mentions that Don Reinertsen recommends internal pricing as a mechanism for avoiding platform consumers demanding things they don’t need.

Enabling Teams

The next type of team is designed to serve other teams to help them improve.  I think it is great that the book acknowledges and explores these as they are very common and often a good thing.  Enabling teams should contain experienced people excited about finding ways to help other teams improve.  To some extent they can be thought of as technical consultants with a remit for helping drive improvement.

Most important is how well Enabling teams engage with other teams.  There are various traps such a team can fall into and must avoid:

  • Becoming an ivory tower defining processes, policies, perhaps even technical ‘decisions’ and inflicting them upon the teams that they are supposed to be helping.
  • Generally being disruptive by causing things like interruptions, context switching, cognitive load, and communication overhead – especially if this cost outweighs the benefit.

As with all other types, the book provides detailed expected behaviours and success criteria for enabling teams such as understanding the needs of the teams they support and taking an experimental approach to meeting these.  It’s also important that enabling teams bring in a wider perspective perhaps of ideas about technology advances and industry technology and process trends.  Enabling teams may be long lived but their engagement with the teams they are supporting should probably not be permanent.

Complicated-Subsystem Teams

Finally, the book defines a 4th type of team called a complicated-subsystem team.  Essentially these teams own the development and maintenance of a complicated subcomponent that is probably consumed as code or a binary, rather than at runtime as a service over a network.  The concept is that the component requires specialist knowledge to build and change and that can be done most effectively by a dedicated team without the cognitive load required to consume and deploy their component.

Other types of teams: SRE teams

I did feel the book acknowledged some other team types outside of the main 4, for example, SRE teams.

Without getting into too much detail, Site Reliability Engineering (SRE) is an approach to Operations that Google developed and started sharing with the world a couple of years ago.  The thing that is most interesting about it from a team design perspective is that an SRE team is a traditional separate operations team.  In this regard it doesn’t really fit the 4 topologies above.  I think there are a couple of reasons the SRE model is successful:

  1. At least at Google, the default mode of operation is Stream Aligned teams.  An SRE team will only operate an application if it is demonstrably reliable and doesn’t require a lot of manual effort to operate.
  2. It promotes use of some conceptually simple but effective metrics to ensure the application meets the standards for an SRE to operate it.

The book says that if your systems are reliable and operable enough you can use SRE teams.  Doing so will of course reduce some cognitive load on the team by having to pay less attention to the demands of an application in production.  This is actually very interesting because in some ways a Stream Aligned team handing over applications to an SRE team to operate is very similar to a traditional Dev and Ops team.

So, the message in a way becomes if your processes are mature enough and your engineering good enough, traditional ways will suffice, BUT the problem is they probably aren’t and instead you need to use Stream Aligned teams until you get there.  I think this leads to an alternative option for organisations – directly implement the measurement of SRE and focus on quality until you can make your existing separate Dev and Ops perform well enough.   I even suggest this justifies a first class fifth team type.

Other types of teams: Service Experience

The book also acknowledges that many business models include call centre teams and service desk teams who engage with customers directly.  These teams are proposed to be kept separate from the team types above but also to ensure information about the operations of the live system makes it back to the Stream Aligned teams.  It also talks about the concept of grouping and tightly aligning Service Experience Teams with Stream Aligned teams.  The Service Experience Teams are named this to emphasise their customer orientation.

A key success criterion is the relationship to Stream Aligned teams which must be close and ideally 1 to 1.  I think this is an excellent point and that in many organisations it’s as much the 1 to many (overly shared) relationship between teams that should be more closely aligned that cause as many problems as the team boundaries.  For example, if an Application support team is spread across too many applications whilst it may achieve some efficiencies of scale, the overall service delivered to each supported team will be far less effective than if the team was subdivided into smaller more closely aligned teams.

If you spotted other team types in the book that were acknowledged as acceptable – let me know.

The Three Modes of Team Interaction

The book creates 3 interaction modes and recommends which team types should use each one.  By interactions the book means how and when teams communicate and collaborate with each other.  Central to this is the idea that communication is expensive and erodes team structures, so should be used wisely.

The first interaction mode the book defines is called Collaboration.  This is a multi-directional regular and close interaction.  It is fast if not often real time and therefore responsive and agile.  It is the mode than enables teams to work most closely together and is useful when there is a high degree of uncertainty or change and co-evolution is needed.

The cost of this mode is higher communication overhead and increased cognitive load – because teams will need to know more about the other teams.

Stream Aligned Teams will be the most likely to use this, probably with other Stream Aligned teams and especially earlier in the life of a particular product when uncertainty is highest, and boundaries are evolving.

The second mode is called X-as-a-Service.  A team wishing to use this interaction mode needs to consciously design and optimise it so that it serves both them and the other parties as effectively as possible.

Making this mode effective entails abstracting unnecessary detail from other teams and making the interface discoverable, self-documented and possibly an API.

The down side of this mode is that if it isn’t implemented effectively it may reduce the flow of other teams.  It also needs to stay relatively static in order to avoid consuming teams having to constantly relearn how to integrate.

As you’ve probably guessed this is a great model for a Platform Team to adopt – especially if they are highly re-used by many consumers.  It can be especially effective when platforms are well established and do not require rapid change or co-evolution of the API.

The final mode is called Facilitating.   This is the best mode for Enabling teams and describes how enabling teams can be helpful without being over demanding.

All of the above modes are described in much more detail than this and have very actionable advice

Continuous Evolution

The book has a section on static topologies i.e. how teams may interact at a point in time.  However, it stresses the importance of sensing and adapting.  As the maturity of a team changes collaboration modes and even team types may change or subdivide.

The Team Topologies book uses SRE as a good example of teams moving from Stream Aligned, to a function containing reliability specialists (who may or may not call themselves SREs) operating as an enabling team, to an SRE team acting as a separate Ops team and then back.  Obviously, there is a tension with making these changes to keep teams effective with the costs of changing teams and resetting back to Storming (as per Tuckman).

The book even offers advice for helping existing functional or traditional teams transition into the new model for example: infrastructure teams to Platform teams, support teams to Stream Aligned teams.

In the final part of this series, I will share some thoughts about how to use this information.

Team Topologies Book Summary – Part 1 of 3: Key Concepts

Team Topologies is one of the latest books published by IT Revolution (the excellent company created by Gene Kim the co-author of The Phoenix Project and author of The Unicorn Project).  Team Topologies was written by Matthew Skelton and Manuel Pais (the people behind the DevOps Topologies website) but it is far more than an ‘extended dance remix’ of that.  The book takes a much wider view on team designs that make companies successful and considers the full socio-technical problem space.

This blog series is a set of highlights from the book followed by some suggestions of what to do with the information.  I hope it will also inspire you to read the book and have some fresh thoughts about improving your organisation.

Conway’s Law Demonstrates Team structures are incredibly important

Conway’s Law is the famous observation that the structure of teams (or more specifically the structure of information flow between groups of people) impacts the design of a system.  For example, if you take a 4-team organisation and ask them to build a compiler, you will most likely get a compiler that has 4 stages to it.  This is important because of the implication that the structure of Teams doesn’t just impact governance, efficiency, and agility, it also impacts the actual architecture of the products that get built.  The architecture of products is vitally important because of the heavy bearing it has on the agility and reliability of not just systems, but the businesses driven by them.

So Conway’s law teaches us two things:

  1. Team designs are very important because good team designs lead to good software design and good software design leads to better more effective teams (or if you get it wrong the cycle goes the other way).
  2. The factors that influence both a good architecture as well as good teams need to be considered when designing teams, team boundaries, and the planned communication required between teams.

General guidance for great teams

The book doesn’t overlook covering general factors that influence high performing people and teams.  For example:

  • Team sizing – Dunbar’s number is highlighted for its recommendation about the limits of how many people can successfully collaborate in different ways:
    • The strongest form of communication happens in an individual team working closely together with a very consistent shared context and specific purpose.  In software this number is supposed to be 8 people.
      • The limit of people who can all deeply trust each other is 15 and there is also significance of the dynamics of 50 and 150 people.  So this can provide useful constraints when thinking about the design of teams of teams.
  • The importance of having long lived teams that are given enough time to get to a state of high performance and then stick together and capitalise on it.  Research shows teams can take 3 months to get to a point where a team is many times more effective than the sum of the individual members.  The better the overall company culture is, the easier changing teams can be, but even in the best cases the research recommends no less than a year.  The book also highlights that the well-known Tuckman team performance model (Forming, Storming, Norming, Performing) has been proven to be less linear than it sounds.  The storming stage has been found to restart every time there is a personnel or major change to the team.
  • Some aspects of office design are important.
  • Putting other team members first and generally investing in relationships and making the team an inclusive place to work is vital.
  • Defining effective team scope, boundaries, and approaches to communication.  This is the broadly the topic of the rest of the book (and this blog).

Team Design Consideration: Minimising Team Handoffs, Queues, and Communication Overhead

The book talks about the importance of organising for flow of change to software products.  In order to do that you need to consider team responsibilities and to minimise and optimise communication.  The book presents a case study from Amazon as exponents of “you build it, you run it” approaches.

You also need what the book calls “sensing” which is where an organisation possesses enough feedback mechanisms to ensure software and service quality is understood as early and clearly as possible.

Whilst teams may communicate and prioritise effectively within the boundary of their team, external communication across team boundaries is always much more costly and much less effective.  When teams have demands upon other teams, this can:

  • be disruptive and lead to context switching
  • lead to queues, delays, and prioritisation problems
  • create a large communication and management overhead.

I found the point about communication thought provoking because it’s often a popular idea that the more collaboration within an organisation, the better.  As the book states, in practice all communication is expensive and should be considered in terms of cost versus benefit.  A friend of mine Tom Burgess pointed out a nice similarity to memory hierarchy in Computer Science.  This also got me thinking about the parallels of people and the Fallacies of Distributed computing!

Team Design Consideration: Cognitive load

The book introduces a very helpful concept from psychology (created by John Sweller) called Cognitive Load.  This describes how much ‘thinking effort’ a particular ‘thing’ needs in order to be done effectively.  It observes that not all types of thinking effort are the same in nature:

  • Different tasks require different types of thinking
  • There are different causes of the need to think
  • There are different strategies for managing and reducing the effort required for the different types of thinking.

The types are:

  • Intrinsic Cognitive Load – this relates to the skill of how to do a task e.g. which buttons to press according to the specific information available and scenario.
  • Extraneous Cognitive Load – this relates to broader environmental knowledge, not related to the specific skill of the task, but still necessary, e.g. what are the surrounding process admin steps that must be done after this task
  • Germane Cognitive Load – this relates to thinking about how to make the work as effective as possible e.g. what should the design be, how will things integrate.

The strategies for managing each type of cognitive load are as follows:

  • Intrinsic Cognitive Load – can be reduced by training people or finding people experienced at a task.  The greater their relevant skill levels, the lower the load required to do the task.
  • Extraneous Cognitive Load – can be reduced by automation or good process design and discoverability.
  • Germane Cognitive Load – is really the type of work you want people focusing on.  However, the amount required is a factor of how much scope someone has to worry about.

This is all very interesting and applicable to individuals, but you might be wondering: how does this relate to organisational team design?  The book presents a very useful idea here: cognitive load should be considered in terms of the total amount required by whole teams.

Putting this in plainer terms, when you are thinking about team design and organisational structure, you need to consider how much collectively you are expecting the team to:

  • know
  • be able to perform
  • be able to make effective decisions about
  • be able to make brilliant.

The reason I found this so powerful is because it gives you a logical way to reason with the current fairly fashionable idea that end-to-end / DevOps / BusDevSecOps(!) teams are the utopia (or worse: the meme that if there are separate teams in the value stream you are not doing DevOps and you are doing it wrong).  Sure, giving as much ownership of a product or service as possible to one team avoids team boundaries, but it also increases the cognitive load on the team and potentially the minimum number of people needed in a team.

Decomposing and fracture planes

So a simplified summary so far:

  • Amazon have helped highlight the benefit of minimising handoffs, queues, and communication
  • Sweller has taught us to avoid giving a team too much Cognitive load
  • Dunbar has taught us to keep teams at around 8 people.

At this point we have to consider the amount of scope assigned to a team as our way to satisfy the above constraints.  If we can keep it small enough, perhaps we can still give them end-to-end autonomy whilst keeping team sizes down and cognitive load manageable.

This is where the book starts talking about a concept of Fracture Plains.  The name is a metaphor about how when stonemasons break rocks, they focus on the natural characteristics of the structure of the rock in order to break them up cleanly and efficiently.  The theory is that software systems also naturally have characteristics that create more effective places to divide things up.  The metaphor is especially poetic considering large tightly coupled software systems are often likened to monolithic rocks.

The book provides a useful discussion of different types of fracture plane to explore including:

  • Business domains (i.e. where the whole domain driven development and bounded contexts design techniques come into play)
  • Change cadence i.e. things that need to change at the same pace
  • Risk i.e. things with similar risk profiles
  • (Most important of all) separation of platform and application (more on that to come).

Ideally all of these can help decompose systems whilst minimising cognitive load and keeping team sizes small enough.

In Part 2 I’ll share the reusable patterns the book proposes for doing this.

SRE Certification – Free Self Test

In early 2017 Martin Fowler wrote a great article called Continuous Integration Certification.  The title referred to a joke he was making by parodying some professional certifications available that give you the title ‘Certified Master of Buzzword’ by paying for a Buzzword certification course and passing a multiple choice test.  (Which reminds me there is a whole site for one hilarious DevOps certification parody).

Martin wasn’t of course actually selling a certification, he was just writing down self-evaluation criteria to help people assess whether they were actually performing Continuous Integration in the spirit that the term was invented, or whether they are missing out by having effectively cargo-culted just the easiest to copy bits.  The criteria for Martin’s self-evaluation had been created and used by Jez Humble at conferences where he politely and helped large rooms of people realise they still had further opportunity for kata around their CI/CD pipelines.  It was an entertaining post that very effectively highlighted important parts of implementing Continuous Integration.

In this post I wanted to attempt to recreate the same blog but on the topic of Site Reliability Engineering (SRE).  So unfortunately I won’t be offering certificates (but feel free to make and print your own).  Ben Treynor-Sloss at Google created the term SRE and Google have documented it at length including two books.  I don’t claim to be an authority on the topic.  But I do have around 18 months of experience at experimenting with SRE practices in different settings, and hence I base my criteria here on what I have found to be most valuable.

  1. Do you have a common language to describe reliability of your services expressed in the eyes of your customer and written in business terms?  You might choose to base this around the terminology that Google defined (SLIs, SLOs).  It needs to be able to describe customer interactions with the system and be able to distinguish as to whether they are of sufficient quality to be consider reliable or not.  It should be understood consistently by all internal stakeholders across people performing the roles of traditional functional areas (Business, Dev, Ops).
  2. Have you used that common language to create common definitions of what reliability means for your specific systems and have you agreed across all internal roles (see above) what acceptable target levels should be?
  3. Are there people identified as being committed to trying to achieve the agreed ‘acceptable’ target levels of reliability by supporting the system when it is experiencing issues?  They may or may not have the job title Site Reliability Engineer.
  4. Have you implemented capturing these metrics from your live systems so that you can evaluate over different time windows whether the target levels of reliability were achieved?
  5. Are there people that review these metrics as part of supporting the live service and use them to prioritise their efforts around resolving incidents that are causing customers to experience events that do not meet the definition of being reliable.  They could be using Error Budget alerting policies for this or alternative approaches like reporting on significant numbers of unreliable events.
  6. When one of the metrics is failing to hit the target value over the agreed measurement window, does this situation lead to consequences happening to address the root causes, improve reliably and increase resilience?  Are those consequences felt across (i.e. will be noticed by) all stakeholder functional areas (Business, Dev, Ops) i.e. they definitely aren’t just the job of the affected people performing support roles to resolve?
  7. Are the consequences in the above step pre-agreed, i.e. the breaching of the target doesn’t not lead to a prioritisation exercise or worse something like a problem record being turned into a story at the bottom of a backlog.  Instead these consequences should happen naturally.
  8. Have you made commitment to the people supporting a service that: manual work, incident resolution, and receiving alerts when on call etc. will only represent part and not all of their job.  The rest of their time will be available for continuous improvement, personal improvement, and other overhead.  You don’t have to call this work Toil (as Google do), you don’t have to set a target of people spending working on Toil at below 50% (as Google do).  You just need to set and expectation with some target and commit to hitting it.
  9. Do you have a well understood definition of what Google call Toil and you are planning to measure and manage the amount of it that teams are performing?
  10. Do you have a mechanism for quantify Toil performed and using that information to prioritise reduction of Toil?
  11. Do you have a mechanism that ensures team members have time to reduce Toil (i.e. pay back operational Technical Debt)?
  12. Do you continuously strive to create a psychologically safe environment and celebrate the learning you get from failures?

If you answered ‘yes’ to all of these I believe you’ve successfully taken on board some of the most valuable parts of SRE.  Print yourself a certificate!  Obviously if you answered “yes” to some and “we’re trying that” with others, then that’s fantastic as well.

You might be surprised to see the omission of things like “you build it you run it”, “full stack engineers”, “operators coding”, “cloud” even.  In my opinion these can be parts of the equation that work for some organisations, but are orthogonal to the practicing of SRE or not.  If you are surprised not to see things about Emergency response processes, data processing pipelines, canary releases etc., it’s not because I don’t think they are important, I just don’t see them (or the emphasis of them) as unique enough to SRE as to be part of my certification.  (Perhaps I should create and advance certification – cha-ching.)

Hope this is helpful.  Please let me know if you have ideas about the list.

Repeat after me “I am technical”

I’d say roughly once a week I hear work colleagues say the words “I am not technical”.  If you recognise this as something you say, I’m writing this blog to try to convince you to stop.

You are technical

The dictionary definition of technical is as follows:

technical ˈtɛknɪk(ə)l/ adjective1. relating to a particular subject, art, or craft, or its techniques.

Did you spot the bit about coding skills?  No, because it has nothing to do with being technical!

Think about any piece of technology and it’s possible to consider levels of detail:

  1. Why is someone paying for it to exist?
  2. What does it do?
  3. What software is running and how does that work?
  4. What platform is the software running on and how does that work?
  5. What hardware is the software running on and how does that work?

You may be tempted to categorise some of these as functional as opposed to technical.  But as per wikipedia a functional requirement is still technical:

“Functional requirements may be calculations, technical details, data manipulation and processing and other specific functionality that define what a system is supposed to accomplish.”

Even if your primary focus is in the higher levels above, and you do feel compelled to draw a line on the list between functional and technical, you almost certainly know a LOT more than the average person in the World about the levels below wherever you place your line.  Your knowledge of this level may be incomplete, but guess what – so is everyone’s.

We all have levels that we feel most comfortable working at.  I think it is vital to continuously learn about the other levels above and below.  Mentally labelling them as incompatible / off limits has no benefit.  No-one understands every detail all the way down or back up the stack.  At some point everyone has to based their understanding on an acceptance that the things that need to work just work.  Watch Physicist Richard Feyman’s video about how (or not actually really how) magnets work for more on this point.

Once you get a degree you think you know everything.
Once you get a masters, you realise you know nothing.
Once you get a Ph.D., you realise 
no one knows anything!  (Anon)

Why it can be harmful to say “I’m not technical”

A huge reason not to brand yourself ‘not technical’ is Stereotype threat.  This is a phenomena studied by social psychology which shows that people from a group where a negative stereotype exists, may experience anxiety about the stereotype that actually hinders their performance and makes it more likely to make the stereotype true.  So applied here, thinking that you are from a group (for example role, part of your company, academic background, etc.) that is less technical may make you find it harder to get more technical.

Why it’s a waste to say “I’m not technical”

As an employee at your fine company, at any stage of your career, I think you are a role model to others.  You are a face that will be associated with all of the amazing technology and technical achievements we are responsible for delivering.  You are a face that will represent in people’s minds what people who work in ‘Tech” look like.  This is an amazing opportunity and privilege and something to be proud of.  I don’t think you should diminish it by claiming your level of contribution is not technical.  It completely wastes the opportunity for you to demonstrate what ‘Tech” really is and make it appealing to others.

We all have a role to play in supporting each other at becoming more technical and one of the simplest steps we can take is to be mindful of the words we use. Many people (for example Dave Snowden here) have observed the positive and negative impacts of this in terms of uniting an in-group and excluding and out-group.  We must play our part in minimising jargon and the unnecessary negative effects of this.  Just remember, when you ask someone what a particular piece of ‘technical’ terminology is, don’t expect them to necessarily have a complete understanding either.

Repeat after me:  “I am technical”.