SLIs and SLOs versus NFRs

In this blog I propose a working definition of the difference between Non Functional Requirements (NFRs), and Service Level Indicators and Objectives (SLIs and SLOs) from SRE.  I couldn’t find anything else written about this topic but I’ve found this definition useful so far in my own experiences.

If you want to first brush up on SLIs and SLOs (both SRE terminology) I suggest reading this.  Otherwise, let’s dive in.

“SLOs are a subset of NFRs.”

All SLOs should be recorded as NFRs, but only NFRs that can be continuously measured in production make suitable SLOs.

All Non-Functional Requirements (NFRs) must be testable but the frequency of testing is variable.  Some require validation just once, some can be periodically re-validated, some need to be performed before relevant changes, and some can be performed continuously in non-production and in production.

When NFRs can be evaluated continuously and it is safe to do so in production they can be adopted as Service Level Indicators (SLIs).  As SLIs they can be measured by Operations (or the SRE team) in Production, managed with Error Budgets, and be subjected to pre-defined Consequences for when Error Budgets are depleted and SLOs breached.

The following table provides some good examples of when to test NFRs.

Validation Frequency Example Explanation
Only Once Application and access control –Backups of the solution must maintain the same security as the Solution. Requires validation perhaps as part of initial Service / Operator Acceptance but is unlikely to need re-validating.
Periodic Disaster Recovery – There must be a failback process defined to revert operations back to primary site It is important to periodically test this process.
Before relevant changes Transaction Performance –The <name> batch component must process <number> <items> within <time> hours.  Should be tested in the SDLC.
Continuous (make good SLI’s and SLO’s!) Transaction Performance –Online performance of new <txn> must complete within <time> seconds in <percent> % of cases, and <time> in 100% of cases. Can be continuously tested.  Perfect candidate for translating into a two Service Level Indicators each with a Service Level objectives (one SLI for each completion time and percentage combination).

 

Some Non-Functional Requirements can be validated in production but without requiring any synthetic load to be applied to the system for example:

Performance must not degrade below NFR Performance Targets over an <8 hour> window when system is executing against typical daily load.

These are perfect candidates for translating into SLI’s with SLO’s.

In addition to this, some NFRs actually that do required generating synthetic load in production may also be a useful.  (Albeit the idea may sound strange at first.) . Rather than waiting until your production system is under real heavy load to see that it copes, you could apply heavy load in production in a more controlled way during times of less critical business consequences.  This is essentially a variant of Chaos Engineering.  Obviously it is essential that synthetic load doesn’t lead to physical or business side effects (e.g. you don’t want your synthetic load to actually order products in your ecommerce platform!) . Essentially this relates to maximising benefit of the Wisdom of Production.

To further illustrate my general point, the following table describes some deliberately bad examples of attempting to validate NFRs at inappropriate frequencies.

Validation Frequency Bad Example (don’t do this) Explanation
Only Once Capacity – concurrent execution – servers not to exceed an average of <75%> CPU utilisation under peak load over a <X minute> duration  If this is happening in production we want to know about it.  NB. It isn’t necessarily good as an SLI though because it doesn’t directly impact users.
Periodic Security –Component failure must not compromise the security features of the Solution. This doesn’t change unless the system architecture changes.  So point validating at arbitrary times.
Before relevant changes The batch system must not exceed the batch-processing window available when executed with production data volumes. Not a bad thing testing this before go live, but it would be a missed opportunity not testing it continuously.
Continuous Internal Compatibility – the new Solution should be based on tried and tested technology. This doesn’t change dynamically and cannot be automatically validated.  So it is not going to change unconsciously and is expensive (human effort intensive) to check.
Continuous Online solution must support <NUMBER> concurrent users in a peak hour/day If we aren’t getting <NUMBER> concurrent users in a system we may not want to generate them artificially as there will be a cost associated with doing that

Please let me know if you have different opinions on this.  I’d love to learn from you.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s