The Importance of Developer Experience in the PaaS Age

Platforms make or break software applications (literally) and they can make or break whole organisations.  Platform teams can also be a highly successful team topology as I’ve covered here.

At this time of rapid advancement in public cloud platform services, anyone with a credit card has ready access to powerful capabilities.  This erodes the advantage that established organisations have over new market entrants. Every day platform capabilities (like container scheduling, caches, GPU farms etc.) become more commoditised.  Every day the importance of integrating these technologies and delivering them internally as effective platform services becomes more important.  

I believe large enterprises need to pivot their existing platforms to meet modern standards and this doesn’t just mean technical capability, but also usability.  The term Developer Experience (DX) can be defined as:

the equivalent of User Experience when the primary user of the product is a developer. DX cares about the developer experience of using a product, its libs, SDKs, documentation, frameworks, open-source solutions, general tools, APIs, etc.

It is nearly 10 years old as a concept with its origins in tech companies looking to get third party developers to use their APIs.  Lots of things impact DX as described excellently here but it is the factors related to platforms (and what to do about them) that I want to focus on.

Key Factors Impacting DX when using a platform

Whilst also a function of things like the platform architecture, the API, and documentation, the following factors are what I believe matter most influence developer experience around a platform.

Minimise Extraneous Cognitive Load  

Psychologist John Sweller created the concept of Cognitive Load which characterises how much ‘thinking effort’ a particular ‘thing’ needs in order to be done effectively.  A good platform will enable developers to devote their brain energy to Germane Cognitive load which is essentially the creative part of performing their work.  It will minimise the Extraneous Cognitive load which means effort thinking about things like how to get code compiled or deployed.  A platform with challenges in this area may have high demands for Extraneous Cognitive load leading developers to waste more effort thinking about how to use the platform rather than what their code should be doing on it.

Maximise The Opportunity for Flow

Another important concept relevant to developer experiences is Flow as defined by psychologist Mihaly Csikszentmihalyi.  In a Flow state of working, people are completely absorbed in a task and hopefully both productive and feeling in their happiest state.  A good platform should enable a workflow that lets developers focus without being interrupted by their tools. This means times that they are waiting for computer tasks to complete like committing code, compiling, or getting feedback from tests needs to be absolutely minimised.  A platform with challenges may hold developers up so badly that it is impossible for them to work in a Flow like state.

Quality of Platform Service

The performance and consistency of the platform including tools and test environments is of vital importance to developer effectiveness.  People very quickly become accustomed to a poor user experience and at that point learned helplessness kicks in, they suffer in silence, and it becomes harder to understand how the platform may be negatively affecting their productivity and engagement.

Improving the platform impacts upon DX

Here are my thoughts on how to drive improvements.

Housing the challenge of improving DX

The challenge of creating effective DX involves factors beyond the scope of any one team.  Factors include the platform architecture and API, skills of the engineering team, other common platform services etc.  Naming a team after DX could be taken to imply that the team are solely responsible for it.  DX should be promoted as something achieved by multiple teams working both individually and together.  That said it can be effective to implement a team to own these complementary but different roles:

  1. Providing some tools as services to developers and as part of the overall platform API.
  2. Helping teams improve their delivery effectiveness through:
    • studying the DX of engineering teams, 
    • involving the right parties, 
    • driving experiments to learn how to improve it.

When focusing on this, it can be helpful to seek input from general (as in not just platform) UX specialists, and also (if you have them) any public API teams and DevRel people.

Studying Developer Experience

A collaborative approach to improving DX can benefit from:

  • Cross team terminology for describing it.
  • Common DX metrics for measuring it
  • Joint initiatives for learning how to improve it.

This is of course reliant on a psychologically safe culture where teams are comfortable sharing their challenges and failures with each other.

It would be valuable to understand more about the current state of DX at an organisation.  This could include:

  • Tracking and sharing internally a baseline of above mentioned DX metrics.
  • Working closely with a rotating VIP engineering team.  (Has the side benefit that they can become advocates.)
  • Analysing Request, Incident and Problem Management of all tickets created by developers and related to the platform.
  • Reviewing metrics around the reliability of the tools (using Service Level Indicators from SRE).
  • Creating personas for engineering team members (considering things like the technologies they use, their proficiency, their team release cadences).
  • Gathering more user analytics directly from the tools e.g. compile times, invalid commands attempted by users.

Experimenting and Learning

With the above foundation of an improved collective understanding of DX, we can try to make changes to drive improvements.  Hopefully the insights will lead to some quick wins that can be implemented immediately.

An approach to more involved change could include:

  • Creating a value framework to help create a consensus about how we can measure and demonstrate improvement. 
  • Correlating our values to the DX metrics.
  • Ensuring we can baseline performance against the metrics.
  • Creating a hypothesis about how an improvement may improve one of the metrics we care about and hence deliver value.
  • Refining the hypothesis to create the lowest cost experiment that shows the fastest results.
  • Prioritising hypothesis’ using expected value divided by estimated effort.
  • Performing experiments and iteratively driving improvements.

 

I hope this has got you thinking about the platform parts of DX.  I would love to hear your own ideas about how to improve them.

Team Topologies Book Summary – Part 3 of 3: Taking Action

In part 1 of this 3 part blog series about the Team Topologies book, I summarised a large set of general things that can help make teams successful. In part 2 I covered the books ideas on team design and team interaction modes.  What an epic so far, and this has just been my summary of parts of the book with much less detail and specific advice.  In this final blog of the series I will leave you with some things to consider doing with all of this excellent information.

From the general guidance for effective teams, consider:

  • Are teams too large?
  • Do they have the right working environment?
  • Do they treat other team members well enough?
  • Are teams being allowed to stay together long enough?

Using the concept of Cognitive load, consider:

  • Do teams have the right skills to handle the intrinsic cognitive load expected upon them?
  • Do teams use enough automation to minimise extrinsic cognitive load?
  • Do teams have effective interaction modes to other teams to minimise extrinsic cognitive load?
  • Do teams have a manageable amount of scope (germane cognitive load) or should they consider exploiting a fracture plain to divide up?
  • Could SRE processes be adopted to keep cognitive load down by continuing to divide it between a Dev and Ops (SRE) team, but also to improve overall alignment and resilience and agility?

The terminology for team types and interaction types in the book is extremely helpful when thinking about the teams in your organisation.

Using the topologies, consider:

  • Which types do our existing teams align to?
  • If our teams do align to a type, are they following the recipes for success in the book or falling into the anti-patterns?
  • If our teams do not align to a type in the book, should they be altered so that they do?  Or do we really want to customise the types for our needs?
  • Are our teams effectively using the interaction types recommended in the book, or if not, are they communicating too much, and generating too much cognitive load?

If existing team structures do not fit the advice of this book and you see room for improvement, what factors have influenced them to become as they are today (e.g. budgeting / finance models)?  Are these things easily surmountable?

I hope this has all been useful and you are now inspired to read the full book, to get much deeper into this vital topic, and to start applying it to your own organisations!

I think the authors Matthew and Manuel did an awesome job and would like to thank them again for writing it.

Team Topologies Book Summary – Part 2 of 3: Topologies and Interaction Modes

In part 1 of this 3 part blog series about the Team Topologies book, I summarised a large set of general things that can help make teams successful.  In this part I will get to actual team design i.e. types of team and how they should interact..

The Four Team Types

Based on their extensive research, the authors present 4 types of team and make an important assertion: to be an effective organisation you only need a mixture of teams conforming to these 4 types.

Personally, I really identified with the 4 types, but even if you have different opinions, it is still very useful to be able to reference and extend / update the common set of types described (in such great detail) in the book.

The 4 types are as follows.

Stream Aligned Teams

This is the name the authors coined for an end-to-end team that owns the full value stream for delivering changes to software products or services AND operating them.  This is the stereotypical modern and popular “you build it, you run it team”.

For Stream Aligned teams to be able to do their job, they need to be comprised of team members that collectively have all of the different skills they need e.g. business knowledge, architecture, development and testing skills, UX, security and compliance knowledge, monitoring and metrics skills etc.  This enables them to play nicely with the requirement of minimising queues and handoffs to other teams (especially in comparison to teams comprised of people performing singular functions e.g. testing).

The expectation is that by teams owning their whole value stream including the performance of the system in production, they can optimise for rapid small batch size changes, and reap all of the expected benefits around both agility and safety.  The hope is also that this end to end scope might help teams achieve “autonomy, mastery and purpose” (things which Daniel Pink highlights as most important to knowledge workers).

The amount of cognitive load required of them is a factor of a few things:

  • If they have enough of the requisite skills in their team, intrinsic cognitive load will be manageable.
  • If the modes of interaction with other teams are clean and efficient, their extrinsic cognitive load shouldn’t be too high.  They can also minimise this within their team through automation.
  • If they were formed using fracture planes effectively enough that they can keep their scope (and therefore their germane cognitive load) to a manageable amount.

Overall this seems like a very compelling team type and indeed the book recommends that the majority of teams in organisations have this form.

The book goes into a lot more detail about how to make a Stream Aligned Team effective which I highly recommend reading.

If the book only included Stream Aligned Teams, I would have considered it a cynical attempt to document fashionable ideas and ignore the factors that have led many people to have other team types.  Fortunately, instead they did a great job at considering the whole picture via 3 other types (plus bonus type SRE – read on!)

Platform Teams

In a tech start-up that begins with just one Stream Aligned Team owning the full stack and software lifecycle end-to-end, at some point they will face into Dunbar’s Number and cognitive overload.  It will be time to look for a fracture planes to enable splitting into at least two teams.  Separating the platform from the application is a very successful and well-established pattern.  At this point we could potentially have one Stream Aligned Team working on the application and one on the platform.

Let’s say the business and demand for application complexity continues to increase and the application team splits into two teams focussed on different application value streams.  We’re now in a situation where they both most likely re-use the same platform team and we get the benefit of re-using it.  This could of course repeat many times.  The authors recognised the nature of running a platform team (especially in terms of the type of coupling to other teams and effective modes of communication) differed enough to create a new team type and this they called Platform Team.

Platform Teams create and operate something re-usable for meeting one or more application hosting requirements.  These could be running platform applications like Kubernetes, a Content Management System as a service, IaaS, or even teams wrapping third party as-a-service services like a Relational Database Service.  They may also be layering on additional features valuable to the consumers of the platform such as security and compliance scanning.

But there are potential traps with platform teams.  If the platform is too opinionated and worse usage is also mandated within an organisation the impedance mismatch may be more harm than good for consuming applications.  If a platform isn’t observable and applications are deployed into a black box, the application teams will be disconnected from enough detail about how their application is performing in production and disempowered to make a difference to it.  The book goes into a lot of detail about how to avoid creating bad platform teams and strongly makes an argument for keeping platforms as thin as possible (which it calls minimum viable platform).

Personally, I think the dynamic that occurs when consuming a platform from a third party is very powerful for a number of reasons:

  • The platform provider is under commercial pressure to create a platform good enough that consumers pay to use it.
  • The platform provider has to be able to deliver the service at a cost point below the total revenue it’s users will give it (so it will be efficient and chose it’s offered services carefully)

If organisations can at least try to think like that (with or without internal use of “wooden dollars”) I think they will create an effective dynamic.  The book mentions that Don Reinertsen recommends internal pricing as a mechanism for avoiding platform consumers demanding things they don’t need.

Enabling Teams

The next type of team is designed to serve other teams to help them improve.  I think it is great that the book acknowledges and explores these as they are very common and often a good thing.  Enabling teams should contain experienced people excited about finding ways to help other teams improve.  To some extent they can be thought of as technical consultants with a remit for helping drive improvement.

Most important is how well Enabling teams engage with other teams.  There are various traps such a team can fall into and must avoid:

  • Becoming an ivory tower defining processes, policies, perhaps even technical ‘decisions’ and inflicting them upon the teams that they are supposed to be helping.
  • Generally being disruptive by causing things like interruptions, context switching, cognitive load, and communication overhead – especially if this cost outweighs the benefit.

As with all other types, the book provides detailed expected behaviours and success criteria for enabling teams such as understanding the needs of the teams they support and taking an experimental approach to meeting these.  It’s also important that enabling teams bring in a wider perspective perhaps of ideas about technology advances and industry technology and process trends.  Enabling teams may be long lived but their engagement with the teams they are supporting should probably not be permanent.

Complicated-Subsystem Teams

Finally, the book defines a 4th type of team called a complicated-subsystem team.  Essentially these teams own the development and maintenance of a complicated subcomponent that is probably consumed as code or a binary, rather than at runtime as a service over a network.  The concept is that the component requires specialist knowledge to build and change and that can be done most effectively by a dedicated team without the cognitive load required to consume and deploy their component.

Other types of teams: SRE teams

I did feel the book acknowledged some other team types outside of the main 4, for example, SRE teams.

Without getting into too much detail, Site Reliability Engineering (SRE) is an approach to Operations that Google developed and started sharing with the world a couple of years ago.  The thing that is most interesting about it from a team design perspective is that an SRE team is a traditional separate operations team.  In this regard it doesn’t really fit the 4 topologies above.  I think there are a couple of reasons the SRE model is successful:

  1. At least at Google, the default mode of operation is Stream Aligned teams.  An SRE team will only operate an application if it is demonstrably reliable and doesn’t require a lot of manual effort to operate.
  2. It promotes use of some conceptually simple but effective metrics to ensure the application meets the standards for an SRE to operate it.

The book says that if your systems are reliable and operable enough you can use SRE teams.  Doing so will of course reduce some cognitive load on the team by having to pay less attention to the demands of an application in production.  This is actually very interesting because in some ways a Stream Aligned team handing over applications to an SRE team to operate is very similar to a traditional Dev and Ops team.

So, the message in a way becomes if your processes are mature enough and your engineering good enough, traditional ways will suffice, BUT the problem is they probably aren’t and instead you need to use Stream Aligned teams until you get there.  I think this leads to an alternative option for organisations – directly implement the measurement of SRE and focus on quality until you can make your existing separate Dev and Ops perform well enough.   I even suggest this justifies a first class fifth team type.

Other types of teams: Service Experience

The book also acknowledges that many business models include call centre teams and service desk teams who engage with customers directly.  These teams are proposed to be kept separate from the team types above but also to ensure information about the operations of the live system makes it back to the Stream Aligned teams.  It also talks about the concept of grouping and tightly aligning Service Experience Teams with Stream Aligned teams.  The Service Experience Teams are named this to emphasise their customer orientation.

A key success criterion is the relationship to Stream Aligned teams which must be close and ideally 1 to 1.  I think this is an excellent point and that in many organisations it’s as much the 1 to many (overly shared) relationship between teams that should be more closely aligned that cause as many problems as the team boundaries.  For example, if an Application support team is spread across too many applications whilst it may achieve some efficiencies of scale, the overall service delivered to each supported team will be far less effective than if the team was subdivided into smaller more closely aligned teams.

If you spotted other team types in the book that were acknowledged as acceptable – let me know.

The Three Modes of Team Interaction

The book creates 3 interaction modes and recommends which team types should use each one.  By interactions the book means how and when teams communicate and collaborate with each other.  Central to this is the idea that communication is expensive and erodes team structures, so should be used wisely.

The first interaction mode the book defines is called Collaboration.  This is a multi-directional regular and close interaction.  It is fast if not often real time and therefore responsive and agile.  It is the mode than enables teams to work most closely together and is useful when there is a high degree of uncertainty or change and co-evolution is needed.

The cost of this mode is higher communication overhead and increased cognitive load – because teams will need to know more about the other teams.

Stream Aligned Teams will be the most likely to use this, probably with other Stream Aligned teams and especially earlier in the life of a particular product when uncertainty is highest, and boundaries are evolving.

The second mode is called X-as-a-Service.  A team wishing to use this interaction mode needs to consciously design and optimise it so that it serves both them and the other parties as effectively as possible.

Making this mode effective entails abstracting unnecessary detail from other teams and making the interface discoverable, self-documented and possibly an API.

The down side of this mode is that if it isn’t implemented effectively it may reduce the flow of other teams.  It also needs to stay relatively static in order to avoid consuming teams having to constantly relearn how to integrate.

As you’ve probably guessed this is a great model for a Platform Team to adopt – especially if they are highly re-used by many consumers.  It can be especially effective when platforms are well established and do not require rapid change or co-evolution of the API.

The final mode is called Facilitating.   This is the best mode for Enabling teams and describes how enabling teams can be helpful without being over demanding.

All of the above modes are described in much more detail than this and have very actionable advice

Continuous Evolution

The book has a section on static topologies i.e. how teams may interact at a point in time.  However, it stresses the importance of sensing and adapting.  As the maturity of a team changes collaboration modes and even team types may change or subdivide.

The Team Topologies book uses SRE as a good example of teams moving from Stream Aligned, to a function containing reliability specialists (who may or may not call themselves SREs) operating as an enabling team, to an SRE team acting as a separate Ops team and then back.  Obviously, there is a tension with making these changes to keep teams effective with the costs of changing teams and resetting back to Storming (as per Tuckman).

The book even offers advice for helping existing functional or traditional teams transition into the new model for example: infrastructure teams to Platform teams, support teams to Stream Aligned teams.

In the final part of this series, I will share some thoughts about how to use this information.

Team Topologies Book Summary – Part 1 of 3: Key Concepts

Team Topologies is one of the latest books published by IT Revolution (the excellent company created by Gene Kim the co-author of The Phoenix Project and author of The Unicorn Project).  Team Topologies was written by Matthew Skelton and Manuel Pais (the people behind the DevOps Topologies website) but it is far more than an ‘extended dance remix’ of that.  The book takes a much wider view on team designs that make companies successful and considers the full socio-technical problem space.

This blog series is a set of highlights from the book followed by some suggestions of what to do with the information.  I hope it will also inspire you to read the book and have some fresh thoughts about improving your organisation.

Conway’s Law Demonstrates Team structures are incredibly important

Conway’s Law is the famous observation that the structure of teams (or more specifically the structure of information flow between groups of people) impacts the design of a system.  For example, if you take a 4-team organisation and ask them to build a compiler, you will most likely get a compiler that has 4 stages to it.  This is important because of the implication that the structure of Teams doesn’t just impact governance, efficiency, and agility, it also impacts the actual architecture of the products that get built.  The architecture of products is vitally important because of the heavy bearing it has on the agility and reliability of not just systems, but the businesses driven by them.

So Conway’s law teaches us two things:

  1. Team designs are very important because good team designs lead to good software design and good software design leads to better more effective teams (or if you get it wrong the cycle goes the other way).
  2. The factors that influence both a good architecture as well as good teams need to be considered when designing teams, team boundaries, and the planned communication required between teams.

General guidance for great teams

The book doesn’t overlook covering general factors that influence high performing people and teams.  For example:

  • Team sizing – Dunbar’s number is highlighted for its recommendation about the limits of how many people can successfully collaborate in different ways:
    • The strongest form of communication happens in an individual team working closely together with a very consistent shared context and specific purpose.  In software this number is supposed to be 8 people.
      • The limit of people who can all deeply trust each other is 15 and there is also significance of the dynamics of 50 and 150 people.  So this can provide useful constraints when thinking about the design of teams of teams.
  • The importance of having long lived teams that are given enough time to get to a state of high performance and then stick together and capitalise on it.  Research shows teams can take 3 months to get to a point where a team is many times more effective than the sum of the individual members.  The better the overall company culture is, the easier changing teams can be, but even in the best cases the research recommends no less than a year.  The book also highlights that the well-known Tuckman team performance model (Forming, Storming, Norming, Performing) has been proven to be less linear than it sounds.  The storming stage has been found to restart every time there is a personnel or major change to the team.
  • Some aspects of office design are important.
  • Putting other team members first and generally investing in relationships and making the team an inclusive place to work is vital.
  • Defining effective team scope, boundaries, and approaches to communication.  This is the broadly the topic of the rest of the book (and this blog).

Team Design Consideration: Minimising Team Handoffs, Queues, and Communication Overhead

The book talks about the importance of organising for flow of change to software products.  In order to do that you need to consider team responsibilities and to minimise and optimise communication.  The book presents a case study from Amazon as exponents of “you build it, you run it” approaches.

You also need what the book calls “sensing” which is where an organisation possesses enough feedback mechanisms to ensure software and service quality is understood as early and clearly as possible.

Whilst teams may communicate and prioritise effectively within the boundary of their team, external communication across team boundaries is always much more costly and much less effective.  When teams have demands upon other teams, this can:

  • be disruptive and lead to context switching
  • lead to queues, delays, and prioritisation problems
  • create a large communication and management overhead.

I found the point about communication thought provoking because it’s often a popular idea that the more collaboration within an organisation, the better.  As the book states, in practice all communication is expensive and should be considered in terms of cost versus benefit.  A friend of mine Tom Burgess pointed out a nice similarity to memory hierarchy in Computer Science.  This also got me thinking about the parallels of people and the Fallacies of Distributed computing!

Team Design Consideration: Cognitive load

The book introduces a very helpful concept from psychology (created by John Sweller) called Cognitive Load.  This describes how much ‘thinking effort’ a particular ‘thing’ needs in order to be done effectively.  It observes that not all types of thinking effort are the same in nature:

  • Different tasks require different types of thinking
  • There are different causes of the need to think
  • There are different strategies for managing and reducing the effort required for the different types of thinking.

The types are:

  • Intrinsic Cognitive Load – this relates to the skill of how to do a task e.g. which buttons to press according to the specific information available and scenario.
  • Extraneous Cognitive Load – this relates to broader environmental knowledge, not related to the specific skill of the task, but still necessary, e.g. what are the surrounding process admin steps that must be done after this task
  • Germane Cognitive Load – this relates to thinking about how to make the work as effective as possible e.g. what should the design be, how will things integrate.

The strategies for managing each type of cognitive load are as follows:

  • Intrinsic Cognitive Load – can be reduced by training people or finding people experienced at a task.  The greater their relevant skill levels, the lower the load required to do the task.
  • Extraneous Cognitive Load – can be reduced by automation or good process design and discoverability.
  • Germane Cognitive Load – is really the type of work you want people focusing on.  However, the amount required is a factor of how much scope someone has to worry about.

This is all very interesting and applicable to individuals, but you might be wondering: how does this relate to organisational team design?  The book presents a very useful idea here: cognitive load should be considered in terms of the total amount required by whole teams.

Putting this in plainer terms, when you are thinking about team design and organisational structure, you need to consider how much collectively you are expecting the team to:

  • know
  • be able to perform
  • be able to make effective decisions about
  • be able to make brilliant.

The reason I found this so powerful is because it gives you a logical way to reason with the current fairly fashionable idea that end-to-end / DevOps / BusDevSecOps(!) teams are the utopia (or worse: the meme that if there are separate teams in the value stream you are not doing DevOps and you are doing it wrong).  Sure, giving as much ownership of a product or service as possible to one team avoids team boundaries, but it also increases the cognitive load on the team and potentially the minimum number of people needed in a team.

Decomposing and fracture planes

So a simplified summary so far:

  • Amazon have helped highlight the benefit of minimising handoffs, queues, and communication
  • Sweller has taught us to avoid giving a team too much Cognitive load
  • Dunbar has taught us to keep teams at around 8 people.

At this point we have to consider the amount of scope assigned to a team as our way to satisfy the above constraints.  If we can keep it small enough, perhaps we can still give them end-to-end autonomy whilst keeping team sizes down and cognitive load manageable.

This is where the book starts talking about a concept of Fracture Plains.  The name is a metaphor about how when stonemasons break rocks, they focus on the natural characteristics of the structure of the rock in order to break them up cleanly and efficiently.  The theory is that software systems also naturally have characteristics that create more effective places to divide things up.  The metaphor is especially poetic considering large tightly coupled software systems are often likened to monolithic rocks.

The book provides a useful discussion of different types of fracture plane to explore including:

  • Business domains (i.e. where the whole domain driven development and bounded contexts design techniques come into play)
  • Change cadence i.e. things that need to change at the same pace
  • Risk i.e. things with similar risk profiles
  • (Most important of all) separation of platform and application (more on that to come).

Ideally all of these can help decompose systems whilst minimising cognitive load and keeping team sizes small enough.

In Part 2 I’ll share the reusable patterns the book proposes for doing this.

SRE Certification – Free Self Test

In early 2017 Martin Fowler wrote a great article called Continuous Integration Certification.  The title referred to a joke he was making by parodying some professional certifications available that give you the title ‘Certified Master of Buzzword’ by paying for a Buzzword certification course and passing a multiple choice test.  (Which reminds me there is a whole site for one hilarious DevOps certification parody).

Martin wasn’t of course actually selling a certification, he was just writing down self-evaluation criteria to help people assess whether they were actually performing Continuous Integration in the spirit that the term was invented, or whether they are missing out by having effectively cargo-culted just the easiest to copy bits.  The criteria for Martin’s self-evaluation had been created and used by Jez Humble at conferences where he politely and helped large rooms of people realise they still had further opportunity for kata around their CI/CD pipelines.  It was an entertaining post that very effectively highlighted important parts of implementing Continuous Integration.

In this post I wanted to attempt to recreate the same blog but on the topic of Site Reliability Engineering (SRE).  So unfortunately I won’t be offering certificates (but feel free to make and print your own).  Ben Treynor-Sloss at Google created the term SRE and Google have documented it at length including two books.  I don’t claim to be an authority on the topic.  But I do have around 18 months of experience at experimenting with SRE practices in different settings, and hence I base my criteria here on what I have found to be most valuable.

  1. Do you have a common language to describe reliability of your services expressed in the eyes of your customer and written in business terms?  You might choose to base this around the terminology that Google defined (SLIs, SLOs).  It needs to be able to describe customer interactions with the system and be able to distinguish as to whether they are of sufficient quality to be consider reliable or not.  It should be understood consistently by all internal stakeholders across people performing the roles of traditional functional areas (Business, Dev, Ops).
  2. Have you used that common language to create common definitions of what reliability means for your specific systems and have you agreed across all internal roles (see above) what acceptable target levels should be?
  3. Are there people identified as being committed to trying to achieve the agreed ‘acceptable’ target levels of reliability by supporting the system when it is experiencing issues?  They may or may not have the job title Site Reliability Engineer.
  4. Have you implemented capturing these metrics from your live systems so that you can evaluate over different time windows whether the target levels of reliability were achieved?
  5. Are there people that review these metrics as part of supporting the live service and use them to prioritise their efforts around resolving incidents that are causing customers to experience events that do not meet the definition of being reliable.  They could be using Error Budget alerting policies for this or alternative approaches like reporting on significant numbers of unreliable events.
  6. When one of the metrics is failing to hit the target value over the agreed measurement window, does this situation lead to consequences happening to address the root causes, improve reliably and increase resilience?  Are those consequences felt across (i.e. will be noticed by) all stakeholder functional areas (Business, Dev, Ops) i.e. they definitely aren’t just the job of the affected people performing support roles to resolve?
  7. Are the consequences in the above step pre-agreed, i.e. the breaching of the target doesn’t not lead to a prioritisation exercise or worse something like a problem record being turned into a story at the bottom of a backlog.  Instead these consequences should happen naturally.
  8. Have you made commitment to the people supporting a service that: manual work, incident resolution, and receiving alerts when on call etc. will only represent part and not all of their job.  The rest of their time will be available for continuous improvement, personal improvement, and other overhead.  You don’t have to call this work Toil (as Google do), you don’t have to set a target of people spending working on Toil at below 50% (as Google do).  You just need to set and expectation with some target and commit to hitting it.
  9. Do you have a well understood definition of what Google call Toil and you are planning to measure and manage the amount of it that teams are performing?
  10. Do you have a mechanism for quantify Toil performed and using that information to prioritise reduction of Toil?
  11. Do you have a mechanism that ensures team members have time to reduce Toil (i.e. pay back operational Technical Debt)?
  12. Do you continuously strive to create a psychologically safe environment and celebrate the learning you get from failures?

If you answered ‘yes’ to all of these I believe you’ve successfully taken on board some of the most valuable parts of SRE.  Print yourself a certificate!  Obviously if you answered “yes” to some and “we’re trying that” with others, then that’s fantastic as well.

You might be surprised to see the omission of things like “you build it you run it”, “full stack engineers”, “operators coding”, “cloud” even.  In my opinion these can be parts of the equation that work for some organisations, but are orthogonal to the practicing of SRE or not.  If you are surprised not to see things about Emergency response processes, data processing pipelines, canary releases etc., it’s not because I don’t think they are important, I just don’t see them (or the emphasis of them) as unique enough to SRE as to be part of my certification.  (Perhaps I should create and advance certification – cha-ching.)

Hope this is helpful.  Please let me know if you have ideas about the list.

Repeat after me “I am technical”

I’d say roughly once a week I hear work colleagues say the words “I am not technical”.  If you recognise this as something you say, I’m writing this blog to try to convince you to stop.

You are technical

The dictionary definition of technical is as follows:

technical ˈtɛknɪk(ə)l/ adjective1. relating to a particular subject, art, or craft, or its techniques.

Did you spot the bit about coding skills?  No, because it has nothing to do with being technical!

Think about any piece of technology and it’s possible to consider levels of detail:

  1. Why is someone paying for it to exist?
  2. What does it do?
  3. What software is running and how does that work?
  4. What platform is the software running on and how does that work?
  5. What hardware is the software running on and how does that work?

You may be tempted to categorise some of these as functional as opposed to technical.  But as per wikipedia a functional requirement is still technical:

“Functional requirements may be calculations, technical details, data manipulation and processing and other specific functionality that define what a system is supposed to accomplish.”

Even if your primary focus is in the higher levels above, and you do feel compelled to draw a line on the list between functional and technical, you almost certainly know a LOT more than the average person in the World about the levels below wherever you place your line.  Your knowledge of this level may be incomplete, but guess what – so is everyone’s.

We all have levels that we feel most comfortable working at.  I think it is vital to continuously learn about the other levels above and below.  Mentally labelling them as incompatible / off limits has no benefit.  No-one understands every detail all the way down or back up the stack.  At some point everyone has to based their understanding on an acceptance that the things that need to work just work.  Watch Physicist Richard Feyman’s video about how (or not actually really how) magnets work for more on this point.

Once you get a degree you think you know everything.
Once you get a masters, you realise you know nothing.
Once you get a Ph.D., you realise 
no one knows anything!  (Anon)

Why it can be harmful to say “I’m not technical”

A huge reason not to brand yourself ‘not technical’ is Stereotype threat.  This is a phenomena studied by social psychology which shows that people from a group where a negative stereotype exists, may experience anxiety about the stereotype that actually hinders their performance and makes it more likely to make the stereotype true.  So applied here, thinking that you are from a group (for example role, part of your company, academic background, etc.) that is less technical may make you find it harder to get more technical.

Why it’s a waste to say “I’m not technical”

As an employee at your fine company, at any stage of your career, I think you are a role model to others.  You are a face that will be associated with all of the amazing technology and technical achievements we are responsible for delivering.  You are a face that will represent in people’s minds what people who work in ‘Tech” look like.  This is an amazing opportunity and privilege and something to be proud of.  I don’t think you should diminish it by claiming your level of contribution is not technical.  It completely wastes the opportunity for you to demonstrate what ‘Tech” really is and make it appealing to others.

We all have a role to play in supporting each other at becoming more technical and one of the simplest steps we can take is to be mindful of the words we use. Many people (for example Dave Snowden here) have observed the positive and negative impacts of this in terms of uniting an in-group and excluding and out-group.  We must play our part in minimising jargon and the unnecessary negative effects of this.  Just remember, when you ask someone what a particular piece of ‘technical’ terminology is, don’t expect them to necessarily have a complete understanding either.

Repeat after me:  “I am technical”.

Buzzwords hack – reading around the hype

Buzzwords (aka buzz terms) are useful.  They are shortcuts to more complex and emerging concepts.  They are tools for grouping aligned ideas and aspirations.  They can be valuable for connecting people with common ambitions to improve something (e.g. you and your customers / suppliers).

But buzzwords become a victim of their own success. The more popular a word becomes, the greater the diversity of definitions.  The result is that literature around a buzzword’s topic can quickly lose its intended meaning and impact.  We’ve all read countless blogs, articles, strategy documents, job descriptions etc., that suffer in this way.  The pitfall lies in the assumption by the author that the buzzword has a commonly and consistently understood meaning.

Here is a very simple suggestion for navigating buzzwords in documents.  I’ve used it quite a bit and found it surprisingly effective.

  1. Find and replace every instance of the buzzword (e.g. SRE!) with _BUZZWORD _ .
  2. Read the document and every time you get to the _BUZZWORD _ text, stop and decide what you think the author actually intended the word to mean.
  3. Start a list of definitions of what you think the author intended when adding it to the current sentence. It’s important to pay attention to the context in which it is used to avoid just focusing on any of your own pre-conceptions about the meaning of the word.  Sometimes (I’m sorry to say) the use of a buzzword actually adds very little – at best they are used to basically just mean something is ‘good’ at worst they add nothing at all.
  4. Update your copy of the document with your longer form and explicit alternative to the buzzword (or leave it out altogether – as the case may be).
  5. Re-read the document and you should hopefully find it making a lot more sense.

In doing this, I find I’ve usually understood the document better and have also gained an insight into the semantic proliferation of the buzzword.

Finally I can recommend trying this on your own work.  It’s a good way to reflect on your own latest view of the meaning of the buzzword.  It can also improve what you’ve written (fewer use of buzzwords can lead to greater clarity).  Just don’t cut out every single one and forgo benefits I mentioned at the start.