What they actually did in the Phoenix Project and will it work for you?

I realised it had been over 6 years since I’d read The Phoenix Project book and decided it deserved a re-read.  I knew I would enjoy it and if nothing else, it would help me understand what people are actually talking about when they reference character names I’ve long forgotten.  Ok I hadn’t forgotten Brent… who could?

Knowing how incredibly popular and influential the book turned out to be made me think of it in a slightly different light.  I tried to empathise with some of the different people I pictured reading the book over the years.  Especially with those for whom it was the first introduction to DevOps.  Inspired by this I decided to try something out: every time someone in the book did something new, I wrote it down.  It’s that list that I thought I would share in this blog.  Later I might opine on it, for now all it’s up to you to think about what will work / is working for you (and please do share you experiences in the comments!)

Here they are in chronological order:

  1. Made change requests visible
  2. Made planned versus requested work visible
  3. Asked for more team members  (didn’t work)
  4. Documented timelines when understanding P1 incidents
  5. Performed practice incident calls
  6. Made change process physical Cards and created the right attention filters and pre-work to enable fast reviews
  7. Categorised work – internal, project, change…
  8. Protected the bottleneck (person called Brent)
  9. Captured how to replicate the work done by the bottleneck
  10. Identify the constraint and exploit it i.e. Have it work on only the most important thing
  11. Ran an exercise with the managers that needed to collaborate better where they all shared personal stories that forced them to be vulnerable in front of each other
  12. Held a session to try to align to a common vision and be honest about as-is
  13. Looked for four types of work: business projects, IT Operations projects, changes, and unplanned work
  14. Projects freeze except Phoenix (top priority one) – single track focus. Then only release work that can’t distract the constraint -those pieces of work are identified by creating a bill of resources for a task
  15. Make work that improves daily work more important than doing daily work! Kata
  16. Work out what to do with: These ‘securityʼ projects decrease your project throughput, which is the constraint for the entire business. And swamp the most constrained resource in your organization. And they donʼt do squat for scalability, availability, survivability, sustainability, security, supportability, or the defensibility of the organization.”
  17. Pass a compliance audit using business processes to backstop IT
  18. If it’s not on the Kanban it won’t get done
  19. Took regularly occurring tasks and worked out how long they take.
  20. Used a FIFO queue to predict completion times.
  21. Communicated predicted completion times to end users
  22. Use mapping out / visualising work to figure out ways to improve it
  23. Created checklists especially when before there are handoffs 2 week PDCA cycles (kata)
  24. Colour coded cards when you are aiming for a particular ratio
  25. Categorised work: uses constraint, increases constraint throughput, other
  26. Where it all started to go wrong and a task was underestimated and has handoffs as well they broke out littles law
  27. Created top 20 recurring tasks to improve upon
  28. Had a minute by minute expediter of important work
  29. Looked at business objectives and (talking to the business owners) align IT and what it can prioritise to that. Found it in the value chains. Then look at how it can jeopardize those – for the CFO?
  30. Talked to all the business leads about how IT affects then.
  31. Mapped the business objectives to the IT systems and then design ways to protect those IT systems
  32. Got the COO to accept the impact of IT on business metrics
  33. Did a business IT alignment activity with the business and use it to do risk management
  34. Did a retrospective on why a recent big program was flawed
  35. Did a complete re-scoping of security efforts
  36. Moved to smaller batch (release) sizes
  37. Towards small batch sizes – aimed for 10 /day
  38. Did a value stream mapping and marked problem areas
  39. Created an end to end dedicated team with dispensation Automated env creation for Devs, qa, prod
  40. Did cloud adoption in 1 week(!)
  41. Created ability to re-create prod at touch of a button(!)
  42. In-sourced something now strategic to the business.

Hope this was helpful.  Shall I do it for The Unicorn Project?

Advertisements

Does your Class DevOps implement Interface DevOps?

I always worry a bit when I hear people repeat Google’s famous statement that “class SRE implements DevOps“.  It’s not that I think it’s wrong.  I certainly agree that the way Google describe SRE makes it sound very much aligned to the objectives of DevOps.  What I worry about is that (as I’ve witnessed) sometimes people misinterpret this to mean they are identical.  The danger of doing this is that people and organisations may close themselves off to the transformative potential of applying the valuable processes and approaches emphasised in SRE.

I’ve seen a lot of “DevOps” initiatives very busy with the Dev side of DevOps i.e. CI/CD pipelines and environment automation, but often with much less attention on Operational concerns i.e. Operability, Reliability, Resilience, Service Management, Observability.  I’ve seen SRE implementations to be very effective at specifically addressing these residual problems.

I think it is true that the SRE body of knowledge contains a lot of valuable advice that is very relevant to implementing DevOps.  I agree that fundamentally they are trying to achieve exactly the same outcomes i.e. sustainably great IT that can enable great businesses.   I agree that for organisations heavily invested in implementing DevOps one way or other it would be very unhelpful to ignore that, duplicate that, or in any way deprecate or disparage that.  BUT there are also certain potential advantages to treating SRE differently within the context of a specific company/enterprise, such as:

  • It may create a renewed amount of focus on the Ops side of DevOps.
  • If starting in traditional Operations teams (by measuring SLIs, TOIL, etc.) it may be possible to deliver improvements (at least in terms of situational awareness) without having to impact inflight DevOps work (which may be CI/CD focused).
  • SRE practices are agnostic of tools, technology, Continuous Delivery maturity etc., so this may be an opportunity to affect those systems that may so far not been the focus of your DevOps efforts.  For example a lot of DevOps attention has been on custom code whilst Digital and harder to reach systems (COTS products, Mainframe, Core mission critical platforms etc.) may still be waiting in line.
  • If the DevOps initiatives in flight are currently heavily focused on Tools-as-a-Service, SRE may present an opportunity to take a fresh and complimentary approach.

So overall yes, understood correctly Google’s famous statement is true.  But if your DevOps instantiation does happen to be a little on the side of big DEV, small ops, and you aren’t quite yet continuously deploying unicorns… I recommend putting some fresh and possibly even separate, highly operation, reliability, and technical debt -focused energy into trying out more ideas from SRE.

 

Top 5 Misconceptions about SRE

I am very excited about Site Reliability Engineering (SRE) and continue to see more and more positive outcomes from people trying it out.  Like any popular term (especially in IT) the natural diversification of what SRE means is already well underway.  In this blog I wanted to share my favourite things that I’ve heard said about SRE, but I consider to be wrong or unhelpful.

  1. “SRE is only for use in Cloud.”

No, I would argue that adoption can be started by anyone irrespective of their technology stack.  At some point cloud makes infrastructure easier to automate, but that may not even be an urgent priority as part of an SRE implementation.  If we consider the Theory of Constraints I think in many cases infrastructure isn’t the true bottleneck for improving IT services.

  1. “SRE is an extreme / more advanced version of DevOps for the super mature.”

No, adoption can be started by anyone independently or combined with DevOps.  Just make sure you are working on SRE with / within the Operations team / function.

  1. “SRE is a special kind of platform development team.”

No, for me this is mixing up concepts.  As I’ve said before, I believe strongly in treating the platform like a first class product.  Having done that you are still presented with the same choice as any product – who develops it, and who operates it and how?  I see SRE as just as applicable to helping with the Dev vs Ops tension on a platform applications as any other software component.

  1. “SRE is just for websites.”

No and it may be helpful to make the S stand for System or Service instead of Site.  (Can’t see it catching on though.)

  1. SRE is only for Google / Tech Giants / Unicorns / Start ups / Thrill seekers.

No, SRE adoption can be safe, gradual, and in some ways is easier for traditional enterprises than DevOps is.  Unlike DevOps, SRE provides alternative options to putting Developers in charge of production.  Plus probably have all of the tools you need to get started already.

Hopefully this was helpful.  In my next blog I’m going to opine a few reservations I have about the popular statement class SRE implements DevOps

Minimum Viable Operations

Several years ago I started using the name Minimum Viable Operations to describe a concept I’d been tinkering with in relation to: Agile, DevOps, flow, service management and operational risk.  I define it as:

The minimum amount of processes, controls and security required to expose a new system or service to a target consumer whilst achieving the desired level of availability and risk.

These days I see Site Reliability Engineering (SRE) rapidly emerging as a set of processes that help achieve this.  Whilst chatting with a friend about it earlier today I realised I’d never published my blog on this, so here it is.

When you need Minimum Viable Operations

These days lots of organisations have successfully sped up the processes of rapidly innovating and developing new applications using modern lightweight architectures and cloud.  They are probably using an agile development methodology and hopefully some good Continuous Delivery perhaps even using something like the Accenture DevOps Platform (ADOP).  They may have achieved this by innovating on the edge i.e. creating an autonomous sub-organisation free from the trappings and the controls of the mainstream IT department.  I’ve worked with quite a few organisations who have done this in separate possibly more “trendy” offices known as Digital Hubs, Innovation Labs, Studios, etc.

But when the sprints are over, the show and tell is complete and the post-it notes start to drop off the wall… was it all worth it?

If hundreds or maybe even thousands of person hours rich with design thinking, new technologies, and automation culminate in an internal demo that doesn’t reach real end users, you haven’t become agile.  If you’ve accidentally diluted the term Minimum Viable Product (MVP) to mean Release 1 and that took 6 months to be completed, you haven’t become agile. 

Whilst prototypes do serve a purpose,

the highest value to an entrepreneurial organisation comes from proving or disproving whether target users are willing to use and keep using a product or service, and whether there are enough users to make it commercially viable.

This is what Eric Ries calls in the Lean Startup book the: Value Hypothesis and the Growth Hypothesis.  This requires putting something “live” in front of real users ASAP and to do that successfully I think you need Minimum Viable Operations (oh why not, let’s give it a TLA: MVO).

Alternative Options to MVO

Modern standalone agile development organisations have four options:

  1. Don’t bother with any operations and just keep building prototypes and dazzling demos.  The downside is that at some point someone may realise you don’t make money and shut you down.
  2. Don’t bother with MVO but try to get your prototype turned into live systems via a project in the traditional IT department.  The downside is that time to market will be huge and the rate of change of your idea will hit a wall.  CR anyone?
  3. Don’t bother with MVO and just put prototypes live.  The downside here is potentially unbounded risk of reputation and damages through outages and security disasters.
  4. Implement MVO, get things in front of customers, get into a nice tight lean OODA loop PDCA cycle perhaps even using the Cynefin framework, and overall create a learning environment capable of exponential growth.  The downside here is that it might feel a lot less safe than prototyping and you will have to learn to celebrate failure.

Obviously option 4 is my favourite.

Implementing MVO

Aside: for simplicity’s sake, I’m conveniently ignoring the vital topic of organisational governance and politics.  Clearly navigating this is of vital importance, but beyond what I want to get into here.

Implementing Minimum Viable Operations is essentially a risk management exercise with a lot in common with standard Operations Architecture (and nowadays SRE).  Everything just needs to be done much, much faster.  This means thinking very hard and upfront about the ilities such as reliability, availability, scaleability and security etc..  But everything needs to be done much faster and also firstly it requires you to be much more realistic and think harder:

  • Reliability – is 3 9’s really needed in order to get your Minimum Viable Product in front of customers and measure their reactions?  Possibly you don’t actually need to create a perfect user experience for everyone all the time and can evaluate your hypothesis’ based on the usage that you are able to achieve.  Obviously this presents something of a reputation-al risk putting out products of a level of service quality unbefitting of your brand.  But do you actually need to apply your brand to the product?  Or perhaps you can incentivise users to put up with unreliability via other rewards for example don’t charge.
  • Availability – we live in a 24 by 7 always on world, but at the start, do we need our MVP to be always online?  Perhaps if we strictly limit use by geography it becomes painless to have downtime windows.  Perhaps as an extension to the reliability argument above, taking the system down for maintenance, re-design, or even to discontinue it, is acceptable.
  • Scaleability – perhaps if our service gets so popular that we cannot keep it live that isn’t such a bad thing, in fact it could be exactly the problem that we want to have.  It’s like the old premature optimisation is the root of all evil arguement.  There are lots of ways to scale things when needed in the cloud.  Perhaps perfecting this isn’t a pre-requisite to go live.
  • Security – obviously this is a vital consideration.  But again this must be strictly “right sized”.  If a system doesn’t contain personal data or take payment information, you aren’t storing data and you aren’t overly concerned about availability, perhaps this doesn’t have to be quite the barrier to go live that you thought it might be.  Plus if you are practicing good DevSecOps in your development process, and using Nexus Lifecycle, you are off to a good start.

Secondly unlike a traditional Operations Architecture scenario where SLAsRPOs, etc. may have been rather arbitrarily defined (perhaps copied and pasted into a contract), when creating MVO for an MVP, you may actually have the opportunity to change the MVP design to make it easier to put live.  Does the MVP actually need to: be branded with the company name, take payment, require registration, etc.?  If none of that impacts your hypothesis’, consider dropping them and going live sooner.

Finally as a long time fan of PaaS, I also have mention that organisations can also make releasing of MVPs much easier if they have invested in a re-usable hosting platform (which of course may benefit from MVO).

What can someone do with this

I’ve seen value in organisations creating a relatively light methodology for doing something similar to MVO.  Nowadays they also might (perfectly reasonably) also call it implementing SRE.  It could include:

  1. Guidelines for making MVPs smaller and easier to put live and operate.
  2. Risk assessment tools to help right-size Operations.
  3. Risk profiles for different types of application.
  4. Peer review checklists to validate someone else’s minimum viable operations architecture.
  5. Reusable platform (PaaS) services with opinions that natively provide certain safety as well as freedom.
  6. Re-usable security services.
  7. Re-usable monitoring solutions.
  8. Training around embracing learning form failure and creating good customer experiments.
  9. A general increase in focus on and awareness of how quickly we are actually putting things live for clients.
  10. When and how to transition from MVO to a full blown SRE controlled service.

Final Comment

DevOps has alas in many cases become conflated with implementing Continuous Delivery into testing environments.  Some times the last mile into Production is still too long.  Call it MVO (probably don’t!), call it SRE (why not?), still call it DevOps, Agile, or Lean (why not?)… I recommending giving some thought to MVO.

SLIs and SLOs versus NFRs

In this blog I propose a working definition of the difference between Non Functional Requirements (NFRs), and Service Level Indicators and Objectives (SLIs and SLOs) from SRE.  I couldn’t find anything else written about this topic but I’ve found this definition useful so far in my own experiences.

If you want to first brush up on SLIs and SLOs (both SRE terminology) I suggest reading this.  Otherwise, let’s dive in.

“SLOs are a subset of NFRs.”

All SLOs should be recorded as NFRs, but only NFRs that can be continuously measured in production make suitable SLOs.

All Non-Functional Requirements (NFRs) must be testable but the frequency of testing is variable.  Some require validation just once, some can be periodically re-validated, some need to be performed before relevant changes, and some can be performed continuously in non-production and in production.

When NFRs can be evaluated continuously and it is safe to do so in production they can be adopted as Service Level Indicators (SLIs).  As SLIs they can be measured by Operations (or the SRE team) in Production, managed with Error Budgets, and be subjected to pre-defined Consequences for when Error Budgets are depleted and SLOs breached.

The following table provides some good examples of when to test NFRs.

Validation Frequency Example Explanation
Only Once Application and access control –Backups of the solution must maintain the same security as the Solution. Requires validation perhaps as part of initial Service / Operator Acceptance but is unlikely to need re-validating.
Periodic Disaster Recovery – There must be a failback process defined to revert operations back to primary site It is important to periodically test this process.
Before relevant changes Transaction Performance –The <name> batch component must process <number> <items> within <time> hours.  Should be tested in the SDLC.
Continuous (make good SLI’s and SLO’s!) Transaction Performance –Online performance of new <txn> must complete within <time> seconds in <percent> % of cases, and <time> in 100% of cases. Can be continuously tested.  Perfect candidate for translating into a two Service Level Indicators each with a Service Level objectives (one SLI for each completion time and percentage combination).

 

Some Non-Functional Requirements can be validated in production but without requiring any synthetic load to be applied to the system for example:

Performance must not degrade below NFR Performance Targets over an <8 hour> window when system is executing against typical daily load.

These are perfect candidates for translating into SLI’s with SLO’s.

In addition to this, some NFRs actually that do required generating synthetic load in production may also be a useful.  (Albeit the idea may sound strange at first.) . Rather than waiting until your production system is under real heavy load to see that it copes, you could apply heavy load in production in a more controlled way during times of less critical business consequences.  This is essentially a variant of Chaos Engineering.  Obviously it is essential that synthetic load doesn’t lead to physical or business side effects (e.g. you don’t want your synthetic load to actually order products in your ecommerce platform!) . Essentially this relates to maximising benefit of the Wisdom of Production.

To further illustrate my general point, the following table describes some deliberately bad examples of attempting to validate NFRs at inappropriate frequencies.

Validation Frequency Bad Example (don’t do this) Explanation
Only Once Capacity – concurrent execution – servers not to exceed an average of <75%> CPU utilisation under peak load over a <X minute> duration  If this is happening in production we want to know about it.  NB. It isn’t necessarily good as an SLI though because it doesn’t directly impact users.
Periodic Security –Component failure must not compromise the security features of the Solution. This doesn’t change unless the system architecture changes.  So point validating at arbitrary times.
Before relevant changes The batch system must not exceed the batch-processing window available when executed with production data volumes. Not a bad thing testing this before go live, but it would be a missed opportunity not testing it continuously.
Continuous Internal Compatibility – the new Solution should be based on tried and tested technology. This doesn’t change dynamically and cannot be automatically validated.  So it is not going to change unconsciously and is expensive (human effort intensive) to check.
Continuous Online solution must support <NUMBER> concurrent users in a peak hour/day If we aren’t getting <NUMBER> concurrent users in a system we may not want to generate them artificially as there will be a cost associated with doing that

Please let me know if you have different opinions on this.  I’d love to learn from you.

Transforming the future together

We have chosen “Transforming the future together” as the theme for the DevOpsDays London 2017 Conference.  I wanted to share here some personal thoughts about what it means.  These may not be the same views of the other organisers (including Bob Walker who proposed the theme).  If you are reading this before 31st May 2017 there is still time for you to submit a proposal to talk!

One of the things I find most inspiring about the DevOps community is the level of sharing, not just in terms of Open Source software, but in terms of strong views on culture, processes, metrics and usage of technology.  I am especially excited by the theme because of the potential for expanding what we share.

ArcaBoard_Dumitru_Popescu

If we are going to transform the future I think to an extent we have to hold an interpretation of:

  • where we are today,
  • where the future might be heading without our intervention,
  • how we would like it to be, and
  • what we might do to influence it.

Even standing alone I can imagine each of these having lots of potential.

The great thing about hearing about the experiences of others is that it can be a shortcut to discovering things without having to try them yourselves.  For example you are walking in the country and you pass someone covered in mud “The middle of this track is much more slippery that it looks”, they volunteer with a pained smile.  Whether you choose to take the side of the path or test your own ability to brave the slippery part, you are much better off for the experience.

I hope to see some talks at DevOpsDays London that attempt to describe with great balance generally where they are today as technology organisations.   Not just how many times a day they can change static wording on their website, but also how fast they change for example their Oracle packaged products and what their bottlenecks are.  

“The future is already here — it’s just not very evenly distributed.”  William Gibson

Naturally everyone is at a different stage of their DevOps journey.  I’m hoping there will be a good range of talks reflecting this variety.

I would love to hear some predictions optimistic and pessimistic about where we might be and where we should be heading.  How will the world change when we cross 50% of people being online?  Are robots really coming for our jobs or our lives? Is blockchain really going to find widespread usage?  Will the private data centre be marginalised as much as private electricity generators have been?  Will Platform-as-a-Service finally gain widespread traction, will it be eclipsed by ServerLess?  Will 100% remote working liberate more of us from living near expensive cities?  Will the debate about what is and isn’t Agile ever end?  What is coming after Micro Services?  Will the gender balance in our industry become more equal?  Can holacracies succeed as a form of organisation?  These questions are of course just the start.

Finally what are people doing now towards achieving some of the above.  What technologies are really starting to change the game?  What new ways of working and techniques are really making an impact (positive or negative).  How are they working to be more inclusive and improve diversity?  Are they actually liking bi-modal IT?  What are they doing to alleviate their bottlenecks?

Please share your own questions that you would like to see answered and once again please consider submitting a proposal.

Image: https://en.wikipedia.org/wiki/File:ArcaBoard_Dumitru_Popescu.jpg

How Psychologically Safe does your team feel?

As per this article, Google conducted a two-year long study into what makes their best teams great and found psychological safety to be the most important factor.

As per Wikipedia, psychological safety can be defined as:

“feeling able to show and employ one’s self without fear of negative consequences of self-image, status or career”

It certainly seems logical to me that creating a safe working environment where people are free to share their individual opinions and avoid group think, is highly important.

So the key question is how can you foster psychological safety?

Some of the best advice I’ve read was from this this blog by Steven M Smith.  He suggests performing the paper-based voting exercise to measure safety.

Whilst we’ve found this to be good technique, the act of screwing up bits of paper is tedious and hard to do remotely.  Hence we’ve created an online tool:

https://safetychecker.herokuapp.com/

Please give it a go and share your experiences!