In part 1 I introduced the topic of building a PaaS strategy that is effective for major enterprises. In part 2 I made a strong argument to buy as a service rather than self-manage. Here I will share my thoughts on balancing re-use and sharing of PaaS instances versus the potentially hidden onward impact on agility and safety.
Number of PaaS Instances and PaaS Fallibility
There is a very easy trap to fall into with PaaS which I admit to have found myself caught by in the past – believing that all of the clever resilience makes it overall infallible. After all we know PaaS’ are built to handle scheduling, re-scheduling, and scaling of compute workloads. We know they are built without single points of failure on all critical components. We know they have redundancy in storage probably at multiple logical and physical levels. But we also mustn’t forget this: they are complex software and humans are involved. For all the clever “magic” in a PaaS, they can have down time and that can be in a very ungraceful “nothing is working” and possibly even “a part of the PaaS you’ve never even heard of is complaining” manner. If you have a PaaS instance, you have to expect that as some point it will be 100% down, taking every last micro-service with it.
The first indication of this unfortunate truth is that you need an approach to having redundancy at the PaaS level. If availability really matters, you are likely going to need more than one PaaS instance (and that is definitely not the same as one PaaS instance spread across multiple availability zones or even regions). You are also going to need to have an approach to fast restoration of service. For example if you have 2 PaaS instances and one goes down, you need to be able to rebuild the first instance before you lose the only one you have left.
As an aside, a quick thank you to Michael Nygard for his excellent explanations of coupling which really helped me understand the topic more clearly.
Failure-state coupling
In virtualisation where you have a hypervisor turning a physical machine into multiple logical virtual machines, there is a well-established concept of a “noisy neighbour”. This means where one virtual machine starts taking more than it’s fair or perhaps even allocated resources and starts starving an impacting other local virtual machines. Perhaps the noisy neighbour VM has a memory leak and is becoming unresponsive. There is a danger that this failure effectively jumps to other VMs through them slowing down and before you know it, you have failure cascading. I think of this as failure-state coupling.
There is exactly the same possible problem with a PaaS instance. The implication of this is very simple: you have to think about what services you actually want sharing a PaaS instance. For example your enterprise has a mortgage advisor virtual agent application and it has a payments platform. Do you ever want to be in the situation where your mortgage advisor agent takes down a shared PaaS and you lose your payments platform. This concept is called isolation. It’s well established in traditional data centres and usually achieved at multiple physical, network, and logical levels. Don’t throw that wisdom out via PaaS exuberance.
So the question of how many PaaS instances has to be answered. The decision needs to factor in both redundancy and isolation. These are very likely to lead you to wanting a handful of PaaS instances of perhaps more (even before you factor in instances for hosting test environments). It’s a shame that the possibility of one highly leveraged and centralised strategic platform instance doesn’t weather reality, but run-time coupling can be too unpalatable.
Dev-time-coupling
So we’ve accepted that we need multiple PaaS instances. At least we can have a wonderfully industrialised PaaS factory, centre of excellence, PaaS instances self-service, PaaS instances SRE team, right? Unfortunately despite the promise of meeting traditional values of industrialisation, re-use, and cost cutting, there is a down side which I like to think of this as dev-time coupling. If we create one enterprise team responsible for creating and operating standardised and hardened PaaS instances we create a dependency upon them and therefore the possibility of queues and teams blocking other teams. Even if the team is just responsible for taking PaaS and serverless services from the public cloud, testing them, and adding some standardised provisioning and configuration code (Terraform and Ansible for example), you have created team coupling. You have created the very real chance that one part of your organisation desperately wants to use a new PaaS service from AWS and they can’t get it for 6 months because it isn’t high enough on the centralised PaaS factory’s agenda. (Even perhaps is all the PaaS factory would be required to do is bless it and perhaps make some IAMs changes.)
When creating a PaaS strategy you need to think hard about how much development efficiency benefit having all teams using consistent PaaS instances is really going to bring versus the potential business impacts through lost autonomy and agility. In my opinion you might as well face up to having multiple different PaaS solutions in different teams and give up trying to make one uber consistent version everywhere. But what about the wild west and sprawl I hear you asking? For me this is where we can turn to Conways Law (and hopefully the use we’ve already made of it). If an organisation has split into business function aligned, more autonomous end to end groups, those boundaries work here. Firstly I think those domains should still be given agency to make decisions about their PaaS implementation (not to say they can’t use that to opt for re-use of other implementations if appropriate). Secondly they might be an appropriate boundary for internal consistency i.e. they accept dev-time coupling within their own group. When organisations have taken the steps of making governance structures like this, it can be a mistake undermine this through an inflicted PaaS implementation.
Operation-time coupling
Even, if we reject the idea that it’s fine to have multiple PaaS implementations and we are resolute on everyone having consistent a version, having almost as many different versions of your consistent PaaS implementation running in production as you have PaaS instances is inevitable.
Firstly I need to deliver one final home truth about PaaS, not only are they hard to upgrade but you also need to upgrade them very regularly. And this doesn’t just mean nested internal software that you don’t notice, often these are breaking changes affecting the API and contract between the PaaS’ and your code.
The idea of developing a new version of your PaaS implementation (perhaps in response to one team wanting new features) and the rolling it out big bang to every instance is very naïve. Think about upgrading just one instance that say runs 100 micro-services, each upgrade exercise has the possibility of impacting every single one conceivably in different ways. That is a lot of testing to co-ordinate and definitely isn’t something you want to make wider in scope than necessary. Let’s say you have 5 PaaS instances across your enterprise for different business aligned groups, are you really going to update them all on the same day / week / even month? You’ll be lucky not to have 5 different versions all running in production. I am not defending avoiding upgrades and drifting towards running out of support cover and increasing security vulnerability, but it’s hard enough co-ordinating for 1 team let alone 5.
So it’s a wrap: what I think an enterprise PaaS strategy should do, where to get PaaS’ from, and then here in part three – how consider the true cost of re-use versus coupling.