Top 5 Misconceptions about SRE

I am very excited about Site Reliability Engineering (SRE) and continue to see more and more positive outcomes from people trying it out.  Like any popular term (especially in IT) the natural diversification of what SRE means is already well underway.  In this blog I wanted to share my favourite things that I’ve heard said about SRE, but I consider to be wrong or unhelpful.

  1. “SRE is only for use in Cloud.”

No, I would argue that adoption can be started by anyone irrespective of their technology stack.  At some point cloud makes infrastructure easier to automate, but that may not even be an urgent priority as part of an SRE implementation.  If we consider the Theory of Constraints I think in many cases infrastructure isn’t the true bottleneck for improving IT services.

  1. “SRE is an extreme / more advanced version of DevOps for the super mature.”

No, adoption can be started by anyone independently or combined with DevOps.  Just make sure you are working on SRE with / within the Operations team / function.

  1. “SRE is a special kind of platform development team.”

No, for me this is mixing up concepts.  As I’ve said before, I believe strongly in treating the platform like a first class product.  Having done that you are still presented with the same choice as any product – who develops it, and who operates it and how?  I see SRE as just as applicable to helping with the Dev vs Ops tension on a platform applications as any other software component.

  1. “SRE is just for websites.”

No and it may be helpful to make the S stand for System or Service instead of Site.  (Can’t see it catching on though.)

  1. SRE is only for Google / Tech Giants / Unicorns / Start ups / Thrill seekers.

No, SRE adoption can be safe, gradual, and in some ways is easier for traditional enterprises than DevOps is.  Unlike DevOps, SRE provides alternative options to putting Developers in charge of production.  Plus probably have all of the tools you need to get started already.

Hopefully this was helpful.  In my next blog I’m going to opine a few reservations I have about the popular statement class SRE implements DevOps

Advertisements

Minimum Viable Operations

Several years ago I started using the name Minimum Viable Operations to describe a concept I’d been tinkering with in relation to: Agile, DevOps, flow, service management and operational risk.  I define it as:

The minimum amount of processes, controls and security required to expose a new system or service to a target consumer whilst achieving the desired level of availability and risk.

These days I see Site Reliability Engineering (SRE) rapidly emerging as a set of processes that help achieve this.  Whilst chatting with a friend about it earlier today I realised I’d never published my blog on this, so here it is.

When you need Minimum Viable Operations

These days lots of organisations have successfully sped up the processes of rapidly innovating and developing new applications using modern lightweight architectures and cloud.  They are probably using an agile development methodology and hopefully some good Continuous Delivery perhaps even using something like the Accenture DevOps Platform (ADOP).  They may have achieved this by innovating on the edge i.e. creating an autonomous sub-organisation free from the trappings and the controls of the mainstream IT department.  I’ve worked with quite a few organisations who have done this in separate possibly more “trendy” offices known as Digital Hubs, Innovation Labs, Studios, etc.

But when the sprints are over, the show and tell is complete and the post-it notes start to drop off the wall… was it all worth it?

If hundreds or maybe even thousands of person hours rich with design thinking, new technologies, and automation culminate in an internal demo that doesn’t reach real end users, you haven’t become agile.  If you’ve accidentally diluted the term Minimum Viable Product (MVP) to mean Release 1 and that took 6 months to be completed, you haven’t become agile. 

Whilst prototypes do serve a purpose,

the highest value to an entrepreneurial organisation comes from proving or disproving whether target users are willing to use and keep using a product or service, and whether there are enough users to make it commercially viable.

This is what Eric Ries calls in the Lean Startup book the: Value Hypothesis and the Growth Hypothesis.  This requires putting something “live” in front of real users ASAP and to do that successfully I think you need Minimum Viable Operations (oh why not, let’s give it a TLA: MVO).

Alternative Options to MVO

Modern standalone agile development organisations have four options:

  1. Don’t bother with any operations and just keep building prototypes and dazzling demos.  The downside is that at some point someone may realise you don’t make money and shut you down.
  2. Don’t bother with MVO but try to get your prototype turned into live systems via a project in the traditional IT department.  The downside is that time to market will be huge and the rate of change of your idea will hit a wall.  CR anyone?
  3. Don’t bother with MVO and just put prototypes live.  The downside here is potentially unbounded risk of reputation and damages through outages and security disasters.
  4. Implement MVO, get things in front of customers, get into a nice tight lean OODA loop PDCA cycle perhaps even using the Cynefin framework, and overall create a learning environment capable of exponential growth.  The downside here is that it might feel a lot less safe than prototyping and you will have to learn to celebrate failure.

Obviously option 4 is my favourite.

Implementing MVO

Aside: for simplicity’s sake, I’m conveniently ignoring the vital topic of organisational governance and politics.  Clearly navigating this is of vital importance, but beyond what I want to get into here.

Implementing Minimum Viable Operations is essentially a risk management exercise with a lot in common with standard Operations Architecture (and nowadays SRE).  Everything just needs to be done much, much faster.  This means thinking very hard and upfront about the ilities such as reliability, availability, scaleability and security etc..  But everything needs to be done much faster and also firstly it requires you to be much more realistic and think harder:

  • Reliability – is 3 9’s really needed in order to get your Minimum Viable Product in front of customers and measure their reactions?  Possibly you don’t actually need to create a perfect user experience for everyone all the time and can evaluate your hypothesis’ based on the usage that you are able to achieve.  Obviously this presents something of a reputation-al risk putting out products of a level of service quality unbefitting of your brand.  But do you actually need to apply your brand to the product?  Or perhaps you can incentivise users to put up with unreliability via other rewards for example don’t charge.
  • Availability – we live in a 24 by 7 always on world, but at the start, do we need our MVP to be always online?  Perhaps if we strictly limit use by geography it becomes painless to have downtime windows.  Perhaps as an extension to the reliability argument above, taking the system down for maintenance, re-design, or even to discontinue it, is acceptable.
  • Scaleability – perhaps if our service gets so popular that we cannot keep it live that isn’t such a bad thing, in fact it could be exactly the problem that we want to have.  It’s like the old premature optimisation is the root of all evil arguement.  There are lots of ways to scale things when needed in the cloud.  Perhaps perfecting this isn’t a pre-requisite to go live.
  • Security – obviously this is a vital consideration.  But again this must be strictly “right sized”.  If a system doesn’t contain personal data or take payment information, you aren’t storing data and you aren’t overly concerned about availability, perhaps this doesn’t have to be quite the barrier to go live that you thought it might be.  Plus if you are practicing good DevSecOps in your development process, and using Nexus Lifecycle, you are off to a good start.

Secondly unlike a traditional Operations Architecture scenario where SLAsRPOs, etc. may have been rather arbitrarily defined (perhaps copied and pasted into a contract), when creating MVO for an MVP, you may actually have the opportunity to change the MVP design to make it easier to put live.  Does the MVP actually need to: be branded with the company name, take payment, require registration, etc.?  If none of that impacts your hypothesis’, consider dropping them and going live sooner.

Finally as a long time fan of PaaS, I also have mention that organisations can also make releasing of MVPs much easier if they have invested in a re-usable hosting platform (which of course may benefit from MVO).

What can someone do with this

I’ve seen value in organisations creating a relatively light methodology for doing something similar to MVO.  Nowadays they also might (perfectly reasonably) also call it implementing SRE.  It could include:

  1. Guidelines for making MVPs smaller and easier to put live and operate.
  2. Risk assessment tools to help right-size Operations.
  3. Risk profiles for different types of application.
  4. Peer review checklists to validate someone else’s minimum viable operations architecture.
  5. Reusable platform (PaaS) services with opinions that natively provide certain safety as well as freedom.
  6. Re-usable security services.
  7. Re-usable monitoring solutions.
  8. Training around embracing learning form failure and creating good customer experiments.
  9. A general increase in focus on and awareness of how quickly we are actually putting things live for clients.
  10. When and how to transition from MVO to a full blown SRE controlled service.

Final Comment

DevOps has alas in many cases become conflated with implementing Continuous Delivery into testing environments.  Some times the last mile into Production is still too long.  Call it MVO (probably don’t!), call it SRE (why not?), still call it DevOps, Agile, or Lean (why not?)… I recommending giving some thought to MVO.

SLIs and SLOs versus NFRs

In this blog I propose a working definition of the difference between Non Functional Requirements (NFRs), and Service Level Indicators and Objectives (SLIs and SLOs) from SRE.  I couldn’t find anything else written about this topic but I’ve found this definition useful so far in my own experiences.

If you want to first brush up on SLIs and SLOs (both SRE terminology) I suggest reading this.  Otherwise, let’s dive in.

“SLOs are a subset of NFRs.”

All SLOs should be recorded as NFRs, but only NFRs that can be continuously measured in production make suitable SLOs.

All Non-Functional Requirements (NFRs) must be testable but the frequency of testing is variable.  Some require validation just once, some can be periodically re-validated, some need to be performed before relevant changes, and some can be performed continuously in non-production and in production.

When NFRs can be evaluated continuously and it is safe to do so in production they can be adopted as Service Level Indicators (SLIs).  As SLIs they can be measured by Operations (or the SRE team) in Production, managed with Error Budgets, and be subjected to pre-defined Consequences for when Error Budgets are depleted and SLOs breached.

The following table provides some good examples of when to test NFRs.

Validation Frequency Example Explanation
Only Once Application and access control –Backups of the solution must maintain the same security as the Solution. Requires validation perhaps as part of initial Service / Operator Acceptance but is unlikely to need re-validating.
Periodic Disaster Recovery – There must be a failback process defined to revert operations back to primary site It is important to periodically test this process.
Before relevant changes Transaction Performance –The <name> batch component must process <number> <items> within <time> hours.  Should be tested in the SDLC.
Continuous (make good SLI’s and SLO’s!) Transaction Performance –Online performance of new <txn> must complete within <time> seconds in <percent> % of cases, and <time> in 100% of cases. Can be continuously tested.  Perfect candidate for translating into a two Service Level Indicators each with a Service Level objectives (one SLI for each completion time and percentage combination).

 

Some Non-Functional Requirements can be validated in production but without requiring any synthetic load to be applied to the system for example:

Performance must not degrade below NFR Performance Targets over an <8 hour> window when system is executing against typical daily load.

These are perfect candidates for translating into SLI’s with SLO’s.

In addition to this, some NFRs actually that do required generating synthetic load in production may also be a useful.  (Albeit the idea may sound strange at first.) . Rather than waiting until your production system is under real heavy load to see that it copes, you could apply heavy load in production in a more controlled way during times of less critical business consequences.  This is essentially a variant of Chaos Engineering.  Obviously it is essential that synthetic load doesn’t lead to physical or business side effects (e.g. you don’t want your synthetic load to actually order products in your ecommerce platform!) . Essentially this relates to maximising benefit of the Wisdom of Production.

To further illustrate my general point, the following table describes some deliberately bad examples of attempting to validate NFRs at inappropriate frequencies.

Validation Frequency Bad Example (don’t do this) Explanation
Only Once Capacity – concurrent execution – servers not to exceed an average of <75%> CPU utilisation under peak load over a <X minute> duration  If this is happening in production we want to know about it.  NB. It isn’t necessarily good as an SLI though because it doesn’t directly impact users.
Periodic Security –Component failure must not compromise the security features of the Solution. This doesn’t change unless the system architecture changes.  So point validating at arbitrary times.
Before relevant changes The batch system must not exceed the batch-processing window available when executed with production data volumes. Not a bad thing testing this before go live, but it would be a missed opportunity not testing it continuously.
Continuous Internal Compatibility – the new Solution should be based on tried and tested technology. This doesn’t change dynamically and cannot be automatically validated.  So it is not going to change unconsciously and is expensive (human effort intensive) to check.
Continuous Online solution must support <NUMBER> concurrent users in a peak hour/day If we aren’t getting <NUMBER> concurrent users in a system we may not want to generate them artificially as there will be a cost associated with doing that

Please let me know if you have different opinions on this.  I’d love to learn from you.