The problem of SRE vs DevOps vs ... whatever
One of the well-intentioned but less valuable conversations that tend to happen around the discipline of “Site Reliability Engineering” is “What’s the difference between a Site Reliability Engineer (SRE) and DevOps”. There is an abundance of articles with helpful comparisons such as “DevOps use automation tools like Puppet or Chef to ensure consistency vs. SRE which avoids these as they do not scale, instead using languages like Python or Bash” or “DevOps are the (people) writing code vs SRE which are more investigative”. I usually answer the facetious “I’m an SRE, as that gets paid more”.
I don’t think any of this comparison is helpful. Instead, I want to provide a different way of looking at the problem that makes us much more accountable for improving our organization’s capacity to deliver a better customer experience.
The things we need to think about are:
Mission
Problem Domains
Team Size & Contribution Models
Mission
As far as I can see, the mission of both “DevOps” and “SRE” still boils down the same way — to ensure the customer experience remains reliable and we retain customers’ trust while enabling the business to further innovate on the customer experience.
The Problem
Ensuring the customer experience remains available and retaining customers’ trust requires various capabilities. For example, we need to be able to:
Determine when there’s a customer experience regression. Occasionally this goes under the banner of “Observability”, “Availability”, or “Bounce Rate” — it all boils down to “How many customers are there online now, and are they having a bad time”.
Determine where the performance regression was introduced. This is Observability — the ability to reason through the (usually distributed) system’s internal state utilizing its external outputs. In practice, this means instrumenting the system with logs, traces, and metrics and using whatever active probing tools you need to understand what production is doing.
Respond to that customer experience regression — especially in an emergency. This is usually called “Incident Response”, meaning that an engineer (often with deep system knowledge and specialized training) will jump in and repair it at any given time of day.
Run the software on some computers (generally a public cloud) and expose that software to the world. This is frequently termed “infrastructure”, but in practice means defining what storage, computing, memory and network requirements a given application has and designing a system that facilitates these requirements. Think “Kubernetes” or “Ansible” or “Saltstack” and so on. If you’re really big (or old), this can mean racking and stacking machinery.
Facilitate updates to the software running on that set of computers in a way that allows engineers to push updates to that software without fearing that the update process will result in a customer visible failure and without an absurd amount of work. This is generally “Continuous Integration” and “Continuous Deployment”,; if you’re lucky, with a healthy dose of “Feature Flagging” or “A/B Testing”.
Provides mechanisms to restore data that invariably becomes corrupted due to an accident or malicious intent. This includes things like “Point-in-time backups”, “Offiste Backups”, and the rarely invoked “Restoration Process”.
While we need to provide these capabilities within the bounds of “Site Reliability Engineering”, there are many different ways to do so — each with its tradeoffs and whose utility depends on the context in which they’re being applied. However, there’s undoubtedly a typical pattern that is universal (as far as I can see) across all organizations.
The general path to success
Within any given organization (at least, in my experience), the customer-facing software only constitutes a small fraction of the actual “work” needed to make that customer experience available. Instead, large chunks of time go into the work around the work — defining the application scaffolding, the compute definitions, the release process and the telemetry required for a reliable, trustworthy customer experience but doesn’t make the experience itself.
Additionally, that class of work tends to be executed frequently across an organization with only minor adjustments in how its executed (e.g. Kubernetes versus ECS, CircleCI vs Drone, Lightstep versus Honeycomb). Fundamentally the problems that face each of these applications — the capabilities we need to expose — remain approximately the same. This work goes into a large bucket Amazon calls “undifferentiated heavy lifting”.
This means our capabilities to contribute to the business are enormous within SRE. We can impact every product, project, software or process in the org! There are several common patterns in the work we do:
Reduce the “Marginal Cost” of a capability
The capabilities the business needs to run come at a cost — usually, a cost to the engineering time spent rediscovering an independent solution to these problems. Our first and most crucial lever then as SREs is to identify a problem for our product development communities and solve it so it is as cheap for these communities to integrate and use as possible.
This requires that we identify the right problem. Identifying the right problem can be deeply challenging, as it requires us to leave our preconceptions of what constitutes a “good” design behind and instead work directly with the product community to understand their perspective. Once we deeply understand the problem, we can devise a solution that maximizes our future architectural choices while commodifying that capability for our product colleagues.
“Solving product engineering problems as cheaply as possible” sounds extraordinarily unsexy. Still, it’s essential to realize how powerful a change agent commodifying a capability can be. Reducing the marginal cost of text communication allowed a generation to develop a new language over “SMS”; something that still permeates our culture today. Reducing the marginal bandwidth cost allowed Netflix, Google Meet and the home office. Reducing the marginal cost of finding a good restaurant on a Friday night means a much happier life partner and a better date night. Reducing marginal cost is fantastic!
After that, we need people actually to use our capability.
Standardizing the organization on an approach
While we can build (or buy) capabilities we expose to the organization, the organization only derives value from these capabilities if they use them. To that end, we must work with our product colleagues to use our newly built capabilities.
This can be deeply challenging. What tends to happen here is that we’ve spent a lot of time building a new capability without interacting with the organization. Then when we go to the organization to advertise and integrate our capability, we figure out it’s a bad product market fit. It is critical to recognize this and kill the product that exposes the capability or use the feedback to evolve it until it is a good fit. Simply wishing that the org or the engineers were different and “saw the wisdom” of our approach is a path to endless frustration.
Beyond that, standardizing the organization around a capability is the process of driving cultural change. There are many models to do this, but my preferred (Kotter’s) works by:
Creating a sense of urgency
Building a coalition
Forming a strategic vision
Enlisting a volunteer army
Enabling action by removing barriers
Generating short-term wins
Sustaining acceleration
Instituting change
The practical implementation of these change models is one for another article, but they’re a worthy investment of SRE time.
Identifying and building differentiating capabilities
When working within a given organization, there are usually a series of (ever-growing) commodity capabilities such as “CI/CD” or “Cloud”, and there usually isn’t a significant differentiator between one vendor and another (or a vendor versus producing it in-house).
A set of limited capabilities that aren’t available on the market or that are uniquely leverageable for your organization if they’re constructed especially for your organisation can be created.
It’s difficult to say where a “differentiating capability” comes from. It’s usually some insight that’s either only possible in your organization or only apparent to people who’ve had a unique amount of organizational experience and are in a position to shepherd this capability. Half the time, it comes from a conversation after work over a beer. Examples I’ve seen in the past include Adaptive Paging or Meaningful Availability. These capabilities will meaningfully reduce the cost of something required or provide some compelling, valuable new insight that serves as a competitive advantage for the business.
Building a differentiating capability should generally be done in steps. The most successful versions I’ve seen go through:
Proof of Concept. Usually, one developer with a unique insight implements it in a weekend or hack week.
Rallying a Team. The team of which that developer is a part pitches for and receives a (limited) amount of funding to build this new capability. Or they bury it in other projects.
Proof, product development. The capability demonstrates value, and the organization happily endorses continued product work.
The challenging part of a differentiating capability is that from an outside perspective, there’s very little difference between “a differentiating capability that would provide a competitive advantage” and “a wild idea that has no merit”. After all, if everyone understood it, chances are it would have been built!
Given this, these capabilities take (and should) much work to deliver. They take time, grit and vision.
Team Size & Contribution Models
Suppose we all have the shared mission of ensuring the customer experience remains reliable. In that case, the question becomes how we service the required capabilities—the customer experience measurement, Observability, incident response, deployment, etc.
How we do this depends a bit on our organisational size and culture:
Solo Operator
In a startup (or very small) organization where there’s likely only 1 “SRE” to 5 - 10 engineers, the most valuable way to provide the capabilities is likely a combination of:
Judicious Vendor Selection: By looking at what is in the open market (especially in major cloud providers such as Google Cloud or Amazon Web Services), you’ll be able to expose capabilities for your colleagues that will enable them to solve their problems without relying on you.
Education: Most of your colleagues will not be domain specialists in any of the capabilities we need to solve for — instead, they’ll probably know much more about … whatever the start specializes in. You’ll need to help them learn the minimum required to operate services effectively so they can take up a sustainable load and survive if you burn out or leave.
Review: Exposing the capabilities of major providers to untrained colleagues often allows them to prototype their work — at a cost rapidly. Usually, a cloud bill is discovered way too late in the project lifecycle. You should work to control these costs, and teams do not inadvertently attempt something that will retrospectively be expensive.
Small Team
In a small team (e.g. the standard “two pizza” team), it is possible to build a certain level of redundancy in any given team member. Surviving team members leaving means that the SRE team can take responsibility away from the organization and run it solely within the SRE team.
This manifests itself in a few ways:
Prescribed Interventions: An SRE team can provide certain consulting services to an organization designed to leverage the domain expertise of those SREs, otherwise too inefficient to do within teams. These include things like “production readiness” checks, “postmortem reviews”, or “reliability task forces”.
Productizing a Capability: An SRE team can expose a new capability through judicious vendor selection and recommendations or by developing that capability as a proof of concept.
Centralizing Decision Making. Where a single SRE might need to align decisions across the organization with several stakeholders, a team of SREs can provide a “service layer” through the prescribed intervention or by productizing a capability. This layer of abstractions means the SRE team can iterate internally, improving a capability without needing to align outside that group.
The capability of a single SRE team is limited in what it can solve, so it should be targeted at solving the most critical problems an organization has at any given time and letting the rest go away. This characteristic often means that this is an extremely fun — if challenging — time to be an SRE, as you will invariably be exposed to everything, and it’s always high pressure.
Department
As an organization grows, it’ll become more and more cost-efficient to build some capabilities in a sufficiently bespoke or organizationally optimized way—for example, incident response tooling, observability tooling that meets compliance requirements, reporting or infrastructure processes.
At the department level (i.e. 30 people), it is no longer to scale an approach based on interpersonal relationships (i.e., the “over beers” management style). Instead, it becomes critically important to articulate a vision of what the SRE department is trying to accomplish and to delegate parts of the execution of that vision to independent teams who are tasked with creating and exposing capabilities. Notably, at the department scale, team members can become more disconnected from their product delivery colleagues and hit the earlier struggle of “standardizing an approach”.
The department level also allows the aggregation of a layer of technical expertise (“staff” or “principal” engineering), which can form the “consultation” arm of such a department. These function as a feedback layer between the teams and the rest of the organization, ensuring we do not drift off course.
Organization
As the organization grows even further, the need to build progressively more capabilities grows parallel. To scale the approach, we need to delegate the vision into strategic buckets further each department can own. This requires a broader, less prescriptive vision and a limited set of strategic choices that guide departments in designing their approach.
The scale of the organization also means that it is possible to internal commodity processes to the organisation’s successful functioning — namely, the discovery and articulation of product requirements.
Given this, a well-functioning product process with high-level KPIs, a breakdown into the KPI tree, a process for prioritizing work and managing a portfolio of projects are all requirements to commodify the capabilities a large organization needs successfully.
The Name
The question at the end of all this is: What do we call the people tasked with ensuring we remain reliable and ensure customers’ trust?
Well, call them whatever you like. In my experience, Site Reliability Engineering is a specialist domain within Software Engineering, like Distributed Systems engineering, Architecture, Developer Productivity, React, Spring and many others. Occasionally hiring people who specialise in these is more challenging, and convincing the organization to pay more if we give our prospective colleagues a unique name is more accessible.
Other than that, call us software developers.
Additional Reading
https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started