SRE Certification – Free Self Test

In early 2017 Martin Fowler wrote a great article called Continuous Integration Certification.  The title referred to a joke he was making by parodying some professional certifications available that give you the title ‘Certified Master of Buzzword’ by paying for a Buzzword certification course and passing a multiple choice test.  (Which reminds me there is a whole site for one hilarious DevOps certification parody).

Martin wasn’t of course actually selling a certification, he was just writing down self-evaluation criteria to help people assess whether they were actually performing Continuous Integration in the spirit that the term was invented, or whether they are missing out by having effectively cargo-culted just the easiest to copy bits.  The criteria for Martin’s self-evaluation had been created and used by Jez Humble at conferences where he politely and helped large rooms of people realise they still had further opportunity for kata around their CI/CD pipelines.  It was an entertaining post that very effectively highlighted important parts of implementing Continuous Integration.

In this post I wanted to attempt to recreate the same blog but on the topic of Site Reliability Engineering (SRE).  So unfortunately I won’t be offering certificates (but feel free to make and print your own).  Ben Treynor-Sloss at Google created the term SRE and Google have documented it at length including two books.  I don’t claim to be an authority on the topic.  But I do have around 18 months of experience at experimenting with SRE practices in different settings, and hence I base my criteria here on what I have found to be most valuable.

  1. Do you have a common language to describe reliability of your services expressed in the eyes of your customer and written in business terms?  You might choose to base this around the terminology that Google defined (SLIs, SLOs).  It needs to be able to describe customer interactions with the system and be able to distinguish as to whether they are of sufficient quality to be consider reliable or not.  It should be understood consistently by all internal stakeholders across people performing the roles of traditional functional areas (Business, Dev, Ops).
  2. Have you used that common language to create common definitions of what reliability means for your specific systems and have you agreed across all internal roles (see above) what acceptable target levels should be?
  3. Are there people identified as being committed to trying to achieve the agreed ‘acceptable’ target levels of reliability by supporting the system when it is experiencing issues?  They may or may not have the job title Site Reliability Engineer.
  4. Have you implemented capturing these metrics from your live systems so that you can evaluate over different time windows whether the target levels of reliability were achieved?
  5. Are there people that review these metrics as part of supporting the live service and use them to prioritise their efforts around resolving incidents that are causing customers to experience events that do not meet the definition of being reliable.  They could be using Error Budget alerting policies for this or alternative approaches like reporting on significant numbers of unreliable events.
  6. When one of the metrics is failing to hit the target value over the agreed measurement window, does this situation lead to consequences happening to address the root causes, improve reliably and increase resilience?  Are those consequences felt across (i.e. will be noticed by) all stakeholder functional areas (Business, Dev, Ops) i.e. they definitely aren’t just the job of the affected people performing support roles to resolve?
  7. Are the consequences in the above step pre-agreed, i.e. the breaching of the target doesn’t not lead to a prioritisation exercise or worse something like a problem record being turned into a story at the bottom of a backlog.  Instead these consequences should happen naturally.
  8. Have you made commitment to the people supporting a service that: manual work, incident resolution, and receiving alerts when on call etc. will only represent part and not all of their job.  The rest of their time will be available for continuous improvement, personal improvement, and other overhead.  You don’t have to call this work Toil (as Google do), you don’t have to set a target of people spending working on Toil at below 50% (as Google do).  You just need to set and expectation with some target and commit to hitting it.
  9. Do you have a well understood definition of what Google call Toil and you are planning to measure and manage the amount of it that teams are performing?
  10. Do you have a mechanism for quantify Toil performed and using that information to prioritise reduction of Toil?
  11. Do you have a mechanism that ensures team members have time to reduce Toil (i.e. pay back operational Technical Debt)?
  12. Do you continuously strive to create a psychologically safe environment and celebrate the learning you get from failures?

If you answered ‘yes’ to all of these I believe you’ve successfully taken on board some of the most valuable parts of SRE.  Print yourself a certificate!  Obviously if you answered “yes” to some and “we’re trying that” with others, then that’s fantastic as well.

You might be surprised to see the omission of things like “you build it you run it”, “full stack engineers”, “operators coding”, “cloud” even.  In my opinion these can be parts of the equation that work for some organisations, but are orthogonal to the practicing of SRE or not.  If you are surprised not to see things about Emergency response processes, data processing pipelines, canary releases etc., it’s not because I don’t think they are important, I just don’t see them (or the emphasis of them) as unique enough to SRE as to be part of my certification.  (Perhaps I should create and advance certification – cha-ching.)

Hope this is helpful.  Please let me know if you have ideas about the list.

Repeat after me “I am technical”

I’d say roughly once a week I hear work colleagues say the words “I am not technical”.  If you recognise this as something you say, I’m writing this blog to try to convince you to stop.

You are technical

The dictionary definition of technical is as follows:

technical ˈtɛknɪk(ə)l/ adjective1. relating to a particular subject, art, or craft, or its techniques.

Did you spot the bit about coding skills?  No, because it has nothing to do with being technical!

Think about any piece of technology and it’s possible to consider levels of detail:

  1. Why is someone paying for it to exist?
  2. What does it do?
  3. What software is running and how does that work?
  4. What platform is the software running on and how does that work?
  5. What hardware is the software running on and how does that work?

You may be tempted to categorise some of these as functional as opposed to technical.  But as per wikipedia a functional requirement is still technical:

“Functional requirements may be calculations, technical details, data manipulation and processing and other specific functionality that define what a system is supposed to accomplish.”

Even if your primary focus is in the higher levels above, and you do feel compelled to draw a line on the list between functional and technical, you almost certainly know a LOT more than the average person in the World about the levels below wherever you place your line.  Your knowledge of this level may be incomplete, but guess what – so is everyone’s.

We all have levels that we feel most comfortable working at.  I think it is vital to continuously learn about the other levels above and below.  Mentally labelling them as incompatible / off limits has no benefit.  No-one understands every detail all the way down or back up the stack.  At some point everyone has to based their understanding on an acceptance that the things that need to work just work.  Watch Physicist Richard Feyman’s video about how (or not actually really how) magnets work for more on this point.

Once you get a degree you think you know everything.
Once you get a masters, you realise you know nothing.
Once you get a Ph.D., you realise 
no one knows anything!  (Anon)

Why it can be harmful to say “I’m not technical”

A huge reason not to brand yourself ‘not technical’ is Stereotype threat.  This is a phenomena studied by social psychology which shows that people from a group where a negative stereotype exists, may experience anxiety about the stereotype that actually hinders their performance and makes it more likely to make the stereotype true.  So applied here, thinking that you are from a group (for example role, part of your company, academic background, etc.) that is less technical may make you find it harder to get more technical.

Why it’s a waste to say “I’m not technical”

As an employee at your fine company, at any stage of your career, I think you are a role model to others.  You are a face that will be associated with all of the amazing technology and technical achievements we are responsible for delivering.  You are a face that will represent in people’s minds what people who work in ‘Tech” look like.  This is an amazing opportunity and privilege and something to be proud of.  I don’t think you should diminish it by claiming your level of contribution is not technical.  It completely wastes the opportunity for you to demonstrate what ‘Tech” really is and make it appealing to others.

We all have a role to play in supporting each other at becoming more technical and one of the simplest steps we can take is to be mindful of the words we use. Many people (for example Dave Snowden here) have observed the positive and negative impacts of this in terms of uniting an in-group and excluding and out-group.  We must play our part in minimising jargon and the unnecessary negative effects of this.  Just remember, when you ask someone what a particular piece of ‘technical’ terminology is, don’t expect them to necessarily have a complete understanding either.

Repeat after me:  “I am technical”.

Buzzwords hack – reading around the hype

Buzzwords (aka buzz terms) are useful.  They are shortcuts to more complex and emerging concepts.  They are tools for grouping aligned ideas and aspirations.  They can be valuable for connecting people with common ambitions to improve something (e.g. you and your customers / suppliers).

But buzzwords become a victim of their own success. The more popular a word becomes, the greater the diversity of definitions.  The result is that literature around a buzzword’s topic can quickly lose its intended meaning and impact.  We’ve all read countless blogs, articles, strategy documents, job descriptions etc., that suffer in this way.  The pitfall lies in the assumption by the author that the buzzword has a commonly and consistently understood meaning.

Here is a very simple suggestion for navigating buzzwords in documents.  I’ve used it quite a bit and found it surprisingly effective.

  1. Find and replace every instance of the buzzword (e.g. SRE!) with _BUZZWORD _ .
  2. Read the document and every time you get to the _BUZZWORD _ text, stop and decide what you think the author actually intended the word to mean.
  3. Start a list of definitions of what you think the author intended when adding it to the current sentence. It’s important to pay attention to the context in which it is used to avoid just focusing on any of your own pre-conceptions about the meaning of the word.  Sometimes (I’m sorry to say) the use of a buzzword actually adds very little – at best they are used to basically just mean something is ‘good’ at worst they add nothing at all.
  4. Update your copy of the document with your longer form and explicit alternative to the buzzword (or leave it out altogether – as the case may be).
  5. Re-read the document and you should hopefully find it making a lot more sense.

In doing this, I find I’ve usually understood the document better and have also gained an insight into the semantic proliferation of the buzzword.

Finally I can recommend trying this on your own work.  It’s a good way to reflect on your own latest view of the meaning of the buzzword.  It can also improve what you’ve written (fewer use of buzzwords can lead to greater clarity).  Just don’t cut out every single one and forgo benefits I mentioned at the start.

What they actually did in the Phoenix Project and will it work for you?

I realised it had been over 6 years since I’d read The Phoenix Project book and decided it deserved a re-read.  I knew I would enjoy it and if nothing else, it would help me understand what people are actually talking about when they reference character names I’ve long forgotten.  Ok I hadn’t forgotten Brent… who could?

Knowing how incredibly popular and influential the book turned out to be made me think of it in a slightly different light.  I tried to empathise with some of the different people I pictured reading the book over the years.  Especially with those for whom it was the first introduction to DevOps.  Inspired by this I decided to try something out: every time someone in the book did something new, I wrote it down.  It’s that list that I thought I would share in this blog.  Later I might opine on it, for now all it’s up to you to think about what will work / is working for you (and please do share you experiences in the comments!)

Here they are in chronological order:

  1. Made change requests visible
  2. Made planned versus requested work visible
  3. Asked for more team members  (didn’t work)
  4. Documented timelines when understanding P1 incidents
  5. Performed practice incident calls
  6. Made change process physical Cards and created the right attention filters and pre-work to enable fast reviews
  7. Categorised work – internal, project, change…
  8. Protected the bottleneck (person called Brent)
  9. Captured how to replicate the work done by the bottleneck
  10. Identify the constraint and exploit it i.e. Have it work on only the most important thing
  11. Ran an exercise with the managers that needed to collaborate better where they all shared personal stories that forced them to be vulnerable in front of each other
  12. Held a session to try to align to a common vision and be honest about as-is
  13. Looked for four types of work: business projects, IT Operations projects, changes, and unplanned work
  14. Projects freeze except Phoenix (top priority one) – single track focus. Then only release work that can’t distract the constraint -those pieces of work are identified by creating a bill of resources for a task
  15. Make work that improves daily work more important than doing daily work! Kata
  16. Work out what to do with: These ‘securityʼ projects decrease your project throughput, which is the constraint for the entire business. And swamp the most constrained resource in your organization. And they donʼt do squat for scalability, availability, survivability, sustainability, security, supportability, or the defensibility of the organization.”
  17. Pass a compliance audit using business processes to backstop IT
  18. If it’s not on the Kanban it won’t get done
  19. Took regularly occurring tasks and worked out how long they take.
  20. Used a FIFO queue to predict completion times.
  21. Communicated predicted completion times to end users
  22. Use mapping out / visualising work to figure out ways to improve it
  23. Created checklists especially when before there are handoffs 2 week PDCA cycles (kata)
  24. Colour coded cards when you are aiming for a particular ratio
  25. Categorised work: uses constraint, increases constraint throughput, other
  26. Where it all started to go wrong and a task was underestimated and has handoffs as well they broke out littles law
  27. Created top 20 recurring tasks to improve upon
  28. Had a minute by minute expediter of important work
  29. Looked at business objectives and (talking to the business owners) align IT and what it can prioritise to that. Found it in the value chains. Then look at how it can jeopardize those – for the CFO?
  30. Talked to all the business leads about how IT affects then.
  31. Mapped the business objectives to the IT systems and then design ways to protect those IT systems
  32. Got the COO to accept the impact of IT on business metrics
  33. Did a business IT alignment activity with the business and use it to do risk management
  34. Did a retrospective on why a recent big program was flawed
  35. Did a complete re-scoping of security efforts
  36. Moved to smaller batch (release) sizes
  37. Towards small batch sizes – aimed for 10 /day
  38. Did a value stream mapping and marked problem areas
  39. Created an end to end dedicated team with dispensation Automated env creation for Devs, qa, prod
  40. Did cloud adoption in 1 week(!)
  41. Created ability to re-create prod at touch of a button(!)
  42. In-sourced something now strategic to the business.

Hope this was helpful.  Shall I do it for The Unicorn Project?

Does your Class DevOps implement Interface DevOps?

I always worry a bit when I hear people repeat Google’s famous statement that “class SRE implements DevOps“.  It’s not that I think it’s wrong.  I certainly agree that the way Google describe SRE makes it sound very much aligned to the objectives of DevOps.  What I worry about is that (as I’ve witnessed) sometimes people misinterpret this to mean they are identical.  The danger of doing this is that people and organisations may close themselves off to the transformative potential of applying the valuable processes and approaches emphasised in SRE.

I’ve seen a lot of “DevOps” initiatives very busy with the Dev side of DevOps i.e. CI/CD pipelines and environment automation, but often with much less attention on Operational concerns i.e. Operability, Reliability, Resilience, Service Management, Observability.  I’ve seen SRE implementations to be very effective at specifically addressing these residual problems.

I think it is true that the SRE body of knowledge contains a lot of valuable advice that is very relevant to implementing DevOps.  I agree that fundamentally they are trying to achieve exactly the same outcomes i.e. sustainably great IT that can enable great businesses.   I agree that for organisations heavily invested in implementing DevOps one way or other it would be very unhelpful to ignore that, duplicate that, or in any way deprecate or disparage that.  BUT there are also certain potential advantages to treating SRE differently within the context of a specific company/enterprise, such as:

  • It may create a renewed amount of focus on the Ops side of DevOps.
  • If starting in traditional Operations teams (by measuring SLIs, TOIL, etc.) it may be possible to deliver improvements (at least in terms of situational awareness) without having to impact inflight DevOps work (which may be CI/CD focused).
  • SRE practices are agnostic of tools, technology, Continuous Delivery maturity etc., so this may be an opportunity to affect those systems that may so far not been the focus of your DevOps efforts.  For example a lot of DevOps attention has been on custom code whilst Digital and harder to reach systems (COTS products, Mainframe, Core mission critical platforms etc.) may still be waiting in line.
  • If the DevOps initiatives in flight are currently heavily focused on Tools-as-a-Service, SRE may present an opportunity to take a fresh and complimentary approach.

So overall yes, understood correctly Google’s famous statement is true.  But if your DevOps instantiation does happen to be a little on the side of big DEV, small ops, and you aren’t quite yet continuously deploying unicorns… I recommend putting some fresh and possibly even separate, highly operation, reliability, and technical debt -focused energy into trying out more ideas from SRE.

 

Top 5 Misconceptions about SRE

I am very excited about Site Reliability Engineering (SRE) and continue to see more and more positive outcomes from people trying it out.  Like any popular term (especially in IT) the natural diversification of what SRE means is already well underway.  In this blog I wanted to share my favourite things that I’ve heard said about SRE, but I consider to be wrong or unhelpful.

  1. “SRE is only for use in Cloud.”

No, I would argue that adoption can be started by anyone irrespective of their technology stack.  At some point cloud makes infrastructure easier to automate, but that may not even be an urgent priority as part of an SRE implementation.  If we consider the Theory of Constraints I think in many cases infrastructure isn’t the true bottleneck for improving IT services.

  1. “SRE is an extreme / more advanced version of DevOps for the super mature.”

No, adoption can be started by anyone independently or combined with DevOps.  Just make sure you are working on SRE with / within the Operations team / function.

  1. “SRE is a special kind of platform development team.”

No, for me this is mixing up concepts.  As I’ve said before, I believe strongly in treating the platform like a first class product.  Having done that you are still presented with the same choice as any product – who develops it, and who operates it and how?  I see SRE as just as applicable to helping with the Dev vs Ops tension on a platform applications as any other software component.

  1. “SRE is just for websites.”

No and it may be helpful to make the S stand for System or Service instead of Site.  (Can’t see it catching on though.)

  1. SRE is only for Google / Tech Giants / Unicorns / Start ups / Thrill seekers.

No, SRE adoption can be safe, gradual, and in some ways is easier for traditional enterprises than DevOps is.  Unlike DevOps, SRE provides alternative options to putting Developers in charge of production.  Plus probably have all of the tools you need to get started already.

Hopefully this was helpful.  In my next blog I’m going to opine a few reservations I have about the popular statement class SRE implements DevOps

Minimum Viable Operations

Several years ago I started using the name Minimum Viable Operations to describe a concept I’d been tinkering with in relation to: Agile, DevOps, flow, service management and operational risk.  I define it as:

The minimum amount of processes, controls and security required to expose a new system or service to a target consumer whilst achieving the desired level of availability and risk.

These days I see Site Reliability Engineering (SRE) rapidly emerging as a set of processes that help achieve this.  Whilst chatting with a friend about it earlier today I realised I’d never published my blog on this, so here it is.

When you need Minimum Viable Operations

These days lots of organisations have successfully sped up the processes of rapidly innovating and developing new applications using modern lightweight architectures and cloud.  They are probably using an agile development methodology and hopefully some good Continuous Delivery perhaps even using something like the Accenture DevOps Platform (ADOP).  They may have achieved this by innovating on the edge i.e. creating an autonomous sub-organisation free from the trappings and the controls of the mainstream IT department.  I’ve worked with quite a few organisations who have done this in separate possibly more “trendy” offices known as Digital Hubs, Innovation Labs, Studios, etc.

But when the sprints are over, the show and tell is complete and the post-it notes start to drop off the wall… was it all worth it?

If hundreds or maybe even thousands of person hours rich with design thinking, new technologies, and automation culminate in an internal demo that doesn’t reach real end users, you haven’t become agile.  If you’ve accidentally diluted the term Minimum Viable Product (MVP) to mean Release 1 and that took 6 months to be completed, you haven’t become agile. 

Whilst prototypes do serve a purpose,

the highest value to an entrepreneurial organisation comes from proving or disproving whether target users are willing to use and keep using a product or service, and whether there are enough users to make it commercially viable.

This is what Eric Ries calls in the Lean Startup book the: Value Hypothesis and the Growth Hypothesis.  This requires putting something “live” in front of real users ASAP and to do that successfully I think you need Minimum Viable Operations (oh why not, let’s give it a TLA: MVO).

Alternative Options to MVO

Modern standalone agile development organisations have four options:

  1. Don’t bother with any operations and just keep building prototypes and dazzling demos.  The downside is that at some point someone may realise you don’t make money and shut you down.
  2. Don’t bother with MVO but try to get your prototype turned into live systems via a project in the traditional IT department.  The downside is that time to market will be huge and the rate of change of your idea will hit a wall.  CR anyone?
  3. Don’t bother with MVO and just put prototypes live.  The downside here is potentially unbounded risk of reputation and damages through outages and security disasters.
  4. Implement MVO, get things in front of customers, get into a nice tight lean OODA loop PDCA cycle perhaps even using the Cynefin framework, and overall create a learning environment capable of exponential growth.  The downside here is that it might feel a lot less safe than prototyping and you will have to learn to celebrate failure.

Obviously option 4 is my favourite.

Implementing MVO

Aside: for simplicity’s sake, I’m conveniently ignoring the vital topic of organisational governance and politics.  Clearly navigating this is of vital importance, but beyond what I want to get into here.

Implementing Minimum Viable Operations is essentially a risk management exercise with a lot in common with standard Operations Architecture (and nowadays SRE).  Everything just needs to be done much, much faster.  This means thinking very hard and upfront about the ilities such as reliability, availability, scaleability and security etc..  But everything needs to be done much faster and also firstly it requires you to be much more realistic and think harder:

  • Reliability – is 3 9’s really needed in order to get your Minimum Viable Product in front of customers and measure their reactions?  Possibly you don’t actually need to create a perfect user experience for everyone all the time and can evaluate your hypothesis’ based on the usage that you are able to achieve.  Obviously this presents something of a reputation-al risk putting out products of a level of service quality unbefitting of your brand.  But do you actually need to apply your brand to the product?  Or perhaps you can incentivise users to put up with unreliability via other rewards for example don’t charge.
  • Availability – we live in a 24 by 7 always on world, but at the start, do we need our MVP to be always online?  Perhaps if we strictly limit use by geography it becomes painless to have downtime windows.  Perhaps as an extension to the reliability argument above, taking the system down for maintenance, re-design, or even to discontinue it, is acceptable.
  • Scaleability – perhaps if our service gets so popular that we cannot keep it live that isn’t such a bad thing, in fact it could be exactly the problem that we want to have.  It’s like the old premature optimisation is the root of all evil arguement.  There are lots of ways to scale things when needed in the cloud.  Perhaps perfecting this isn’t a pre-requisite to go live.
  • Security – obviously this is a vital consideration.  But again this must be strictly “right sized”.  If a system doesn’t contain personal data or take payment information, you aren’t storing data and you aren’t overly concerned about availability, perhaps this doesn’t have to be quite the barrier to go live that you thought it might be.  Plus if you are practicing good DevSecOps in your development process, and using Nexus Lifecycle, you are off to a good start.

Secondly unlike a traditional Operations Architecture scenario where SLAsRPOs, etc. may have been rather arbitrarily defined (perhaps copied and pasted into a contract), when creating MVO for an MVP, you may actually have the opportunity to change the MVP design to make it easier to put live.  Does the MVP actually need to: be branded with the company name, take payment, require registration, etc.?  If none of that impacts your hypothesis’, consider dropping them and going live sooner.

Finally as a long time fan of PaaS, I also have mention that organisations can also make releasing of MVPs much easier if they have invested in a re-usable hosting platform (which of course may benefit from MVO).

What can someone do with this

I’ve seen value in organisations creating a relatively light methodology for doing something similar to MVO.  Nowadays they also might (perfectly reasonably) also call it implementing SRE.  It could include:

  1. Guidelines for making MVPs smaller and easier to put live and operate.
  2. Risk assessment tools to help right-size Operations.
  3. Risk profiles for different types of application.
  4. Peer review checklists to validate someone else’s minimum viable operations architecture.
  5. Reusable platform (PaaS) services with opinions that natively provide certain safety as well as freedom.
  6. Re-usable security services.
  7. Re-usable monitoring solutions.
  8. Training around embracing learning form failure and creating good customer experiments.
  9. A general increase in focus on and awareness of how quickly we are actually putting things live for clients.
  10. When and how to transition from MVO to a full blown SRE controlled service.

Final Comment

DevOps has alas in many cases become conflated with implementing Continuous Delivery into testing environments.  Some times the last mile into Production is still too long.  Call it MVO (probably don’t!), call it SRE (why not?), still call it DevOps, Agile, or Lean (why not?)… I recommending giving some thought to MVO.