SRE Certification – Free Self Test

In early 2017 Martin Fowler wrote a great article called Continuous Integration Certification.  The title referred to a joke he was making by parodying some professional certifications available that give you the title ‘Certified Master of Buzzword’ by paying for a Buzzword certification course and passing a multiple choice test.  (Which reminds me there is a whole site for one hilarious DevOps certification parody).

Martin wasn’t of course actually selling a certification, he was just writing down self-evaluation criteria to help people assess whether they were actually performing Continuous Integration in the spirit that the term was invented, or whether they are missing out by having effectively cargo-culted just the easiest to copy bits.  The criteria for Martin’s self-evaluation had been created and used by Jez Humble at conferences where he politely and helped large rooms of people realise they still had further opportunity for kata around their CI/CD pipelines.  It was an entertaining post that very effectively highlighted important parts of implementing Continuous Integration.

In this post I wanted to attempt to recreate the same blog but on the topic of Site Reliability Engineering (SRE).  So unfortunately I won’t be offering certificates (but feel free to make and print your own).  Ben Treynor-Sloss at Google created the term SRE and Google have documented it at length including two books.  I don’t claim to be an authority on the topic.  But I do have around 18 months of experience at experimenting with SRE practices in different settings, and hence I base my criteria here on what I have found to be most valuable.

  1. Do you have a common language to describe reliability of your services expressed in the eyes of your customer and written in business terms?  You might choose to base this around the terminology that Google defined (SLIs, SLOs).  It needs to be able to describe customer interactions with the system and be able to distinguish as to whether they are of sufficient quality to be consider reliable or not.  It should be understood consistently by all internal stakeholders across people performing the roles of traditional functional areas (Business, Dev, Ops).
  2. Have you used that common language to create common definitions of what reliability means for your specific systems and have you agreed across all internal roles (see above) what acceptable target levels should be?
  3. Are there people identified as being committed to trying to achieve the agreed ‘acceptable’ target levels of reliability by supporting the system when it is experiencing issues?  They may or may not have the job title Site Reliability Engineer.
  4. Have you implemented capturing these metrics from your live systems so that you can evaluate over different time windows whether the target levels of reliability were achieved?
  5. Are there people that review these metrics as part of supporting the live service and use them to prioritise their efforts around resolving incidents that are causing customers to experience events that do not meet the definition of being reliable.  They could be using Error Budget alerting policies for this or alternative approaches like reporting on significant numbers of unreliable events.
  6. When one of the metrics is failing to hit the target value over the agreed measurement window, does this situation lead to consequences happening to address the root causes, improve reliably and increase resilience?  Are those consequences felt across (i.e. will be noticed by) all stakeholder functional areas (Business, Dev, Ops) i.e. they definitely aren’t just the job of the affected people performing support roles to resolve?
  7. Are the consequences in the above step pre-agreed, i.e. the breaching of the target doesn’t not lead to a prioritisation exercise or worse something like a problem record being turned into a story at the bottom of a backlog.  Instead these consequences should happen naturally.
  8. Have you made commitment to the people supporting a service that: manual work, incident resolution, and receiving alerts when on call etc. will only represent part and not all of their job.  The rest of their time will be available for continuous improvement, personal improvement, and other overhead.  You don’t have to call this work Toil (as Google do), you don’t have to set a target of people spending working on Toil at below 50% (as Google do).  You just need to set and expectation with some target and commit to hitting it.
  9. Do you have a well understood definition of what Google call Toil and you are planning to measure and manage the amount of it that teams are performing?
  10. Do you have a mechanism for quantify Toil performed and using that information to prioritise reduction of Toil?
  11. Do you have a mechanism that ensures team members have time to reduce Toil (i.e. pay back operational Technical Debt)?
  12. Do you continuously strive to create a psychologically safe environment and celebrate the learning you get from failures?

If you answered ‘yes’ to all of these I believe you’ve successfully taken on board some of the most valuable parts of SRE.  Print yourself a certificate!  Obviously if you answered “yes” to some and “we’re trying that” with others, then that’s fantastic as well.

You might be surprised to see the omission of things like “you build it you run it”, “full stack engineers”, “operators coding”, “cloud” even.  In my opinion these can be parts of the equation that work for some organisations, but are orthogonal to the practicing of SRE or not.  If you are surprised not to see things about Emergency response processes, data processing pipelines, canary releases etc., it’s not because I don’t think they are important, I just don’t see them (or the emphasis of them) as unique enough to SRE as to be part of my certification.  (Perhaps I should create and advance certification – cha-ching.)

Hope this is helpful.  Please let me know if you have ideas about the list.

Repeat after me “I am technical”

I’d say roughly once a week I hear work colleagues say the words “I am not technical”.  If you recognise this as something you say, I’m writing this blog to try to convince you to stop.

You are technical

The dictionary definition of technical is as follows:

technical ˈtɛknɪk(ə)l/ adjective1. relating to a particular subject, art, or craft, or its techniques.

Did you spot the bit about coding skills?  No, because it has nothing to do with being technical!

Think about any piece of technology and it’s possible to consider levels of detail:

  1. Why is someone paying for it to exist?
  2. What does it do?
  3. What software is running and how does that work?
  4. What platform is the software running on and how does that work?
  5. What hardware is the software running on and how does that work?

You may be tempted to categorise some of these as functional as opposed to technical.  But as per wikipedia a functional requirement is still technical:

“Functional requirements may be calculations, technical details, data manipulation and processing and other specific functionality that define what a system is supposed to accomplish.”

Even if your primary focus is in the higher levels above, and you do feel compelled to draw a line on the list between functional and technical, you almost certainly know a LOT more than the average person in the World about the levels below wherever you place your line.  Your knowledge of this level may be incomplete, but guess what – so is everyone’s.

We all have levels that we feel most comfortable working at.  I think it is vital to continuously learn about the other levels above and below.  Mentally labelling them as incompatible / off limits has no benefit.  No-one understands every detail all the way down or back up the stack.  At some point everyone has to based their understanding on an acceptance that the things that need to work just work.  Watch Physicist Richard Feyman’s video about how (or not actually really how) magnets work for more on this point.

Once you get a degree you think you know everything.
Once you get a masters, you realise you know nothing.
Once you get a Ph.D., you realise 
no one knows anything!  (Anon)

Why it can be harmful to say “I’m not technical”

A huge reason not to brand yourself ‘not technical’ is Stereotype threat.  This is a phenomena studied by social psychology which shows that people from a group where a negative stereotype exists, may experience anxiety about the stereotype that actually hinders their performance and makes it more likely to make the stereotype true.  So applied here, thinking that you are from a group (for example role, part of your company, academic background, etc.) that is less technical may make you find it harder to get more technical.

Why it’s a waste to say “I’m not technical”

As an employee at your fine company, at any stage of your career, I think you are a role model to others.  You are a face that will be associated with all of the amazing technology and technical achievements we are responsible for delivering.  You are a face that will represent in people’s minds what people who work in ‘Tech” look like.  This is an amazing opportunity and privilege and something to be proud of.  I don’t think you should diminish it by claiming your level of contribution is not technical.  It completely wastes the opportunity for you to demonstrate what ‘Tech” really is and make it appealing to others.

We all have a role to play in supporting each other at becoming more technical and one of the simplest steps we can take is to be mindful of the words we use. Many people (for example Dave Snowden here) have observed the positive and negative impacts of this in terms of uniting an in-group and excluding and out-group.  We must play our part in minimising jargon and the unnecessary negative effects of this.  Just remember, when you ask someone what a particular piece of ‘technical’ terminology is, don’t expect them to necessarily have a complete understanding either.

Repeat after me:  “I am technical”.