In early 2017 Martin Fowler wrote a great article called Continuous Integration Certification. The title referred to a joke he was making by parodying some professional certifications available that give you the title ‘Certified Master of Buzzword’ by paying for a Buzzword certification course and passing a multiple choice test. (Which reminds me there is a whole site for one hilarious DevOps certification parody).
Martin wasn’t of course actually selling a certification, he was just writing down self-evaluation criteria to help people assess whether they were actually performing Continuous Integration in the spirit that the term was invented, or whether they are missing out by having effectively cargo-culted just the easiest to copy bits. The criteria for Martin’s self-evaluation had been created and used by Jez Humble at conferences where he politely and helped large rooms of people realise they still had further opportunity for kata around their CI/CD pipelines. It was an entertaining post that very effectively highlighted important parts of implementing Continuous Integration.
In this post I wanted to attempt to recreate the same blog but on the topic of Site Reliability Engineering (SRE). So unfortunately I won’t be offering certificates (but feel free to make and print your own). Ben Treynor-Sloss at Google created the term SRE and Google have documented it at length including two books. I don’t claim to be an authority on the topic. But I do have around 18 months of experience at experimenting with SRE practices in different settings, and hence I base my criteria here on what I have found to be most valuable.
- Do you have a common language to describe reliability of your services expressed in the eyes of your customer and written in business terms? You might choose to base this around the terminology that Google defined (SLIs, SLOs). It needs to be able to describe customer interactions with the system and be able to distinguish as to whether they are of sufficient quality to be consider reliable or not. It should be understood consistently by all internal stakeholders across people performing the roles of traditional functional areas (Business, Dev, Ops).
- Have you used that common language to create common definitions of what reliability means for your specific systems and have you agreed across all internal roles (see above) what acceptable target levels should be?
- Are there people identified as being committed to trying to achieve the agreed ‘acceptable’ target levels of reliability by supporting the system when it is experiencing issues? They may or may not have the job title Site Reliability Engineer.
- Have you implemented capturing these metrics from your live systems so that you can evaluate over different time windows whether the target levels of reliability were achieved?
- Are there people that review these metrics as part of supporting the live service and use them to prioritise their efforts around resolving incidents that are causing customers to experience events that do not meet the definition of being reliable. They could be using Error Budget alerting policies for this or alternative approaches like reporting on significant numbers of unreliable events.
- When one of the metrics is failing to hit the target value over the agreed measurement window, does this situation lead to consequences happening to address the root causes, improve reliably and increase resilience? Are those consequences felt across (i.e. will be noticed by) all stakeholder functional areas (Business, Dev, Ops) i.e. they definitely aren’t just the job of the affected people performing support roles to resolve?
- Are the consequences in the above step pre-agreed, i.e. the breaching of the target doesn’t not lead to a prioritisation exercise or worse something like a problem record being turned into a story at the bottom of a backlog. Instead these consequences should happen naturally.
- Have you made commitment to the people supporting a service that: manual work, incident resolution, and receiving alerts when on call etc. will only represent part and not all of their job. The rest of their time will be available for continuous improvement, personal improvement, and other overhead. You don’t have to call this work Toil (as Google do), you don’t have to set a target of people spending working on Toil at below 50% (as Google do). You just need to set and expectation with some target and commit to hitting it.
- Do you have a well understood definition of what Google call Toil and you are planning to measure and manage the amount of it that teams are performing?
- Do you have a mechanism for quantify Toil performed and using that information to prioritise reduction of Toil?
- Do you have a mechanism that ensures team members have time to reduce Toil (i.e. pay back operational Technical Debt)?
- Do you continuously strive to create a psychologically safe environment and celebrate the learning you get from failures?
If you answered ‘yes’ to all of these I believe you’ve successfully taken on board some of the most valuable parts of SRE. Print yourself a certificate! Obviously if you answered “yes” to some and “we’re trying that” with others, then that’s fantastic as well.
You might be surprised to see the omission of things like “you build it you run it”, “full stack engineers”, “operators coding”, “cloud” even. In my opinion these can be parts of the equation that work for some organisations, but are orthogonal to the practicing of SRE or not. If you are surprised not to see things about Emergency response processes, data processing pipelines, canary releases etc., it’s not because I don’t think they are important, I just don’t see them (or the emphasis of them) as unique enough to SRE as to be part of my certification. (Perhaps I should create and advance certification – cha-ching.)
Hope this is helpful. Please let me know if you have ideas about the list.