Simple, Beautiful Software Development

The value of a software engineer

Andrew howden — Tue, 12 May 2026 09:33:12 GMT

Photo by Mimi Thian on Unsplash (Edited)

As a software developer I take a great deal of time to try and further develop my skills and understanding of computers, software, the architecture thereof running software in production and measuring the success thereof.

It is thus a little perturbing that in my more recent reflection I’ve come to consider these skills and and of themselves — well, they’re not worth very much. That’s somewhat frustrating as such skills can be heavily rewarded within our mono-culture of our software development peers, and are often inroads to furthering our prestige and careers. However, though were are tempted we cannot forget the fundamental nature of what we do:

We serve users.

Those members of our broader communities; the parents, children, white and blue collar workers, the richer and poorer members, those who are local to the area and those who have immigrated— those complex entities known as “people”.

It is given this that I think it’s worth re-evaluating how we approach our own self assessment of our equity with a view to improve some of the skills that are perhaps less emphasized within our profession:

Heads Up: This post is old, originally written in 2019. I was a senior software engineer at the time. That said, it’s now something I’m sending to a few people in the age of Agents, and I wanted to have a better place for it. So, it’s here!

Understanding each other

All computer problems are, fundamentally, human problems. Perhaps uniquely so as software developers we sit atop an entire stack of our forebearer’s ideas about not only software, but about life, the nature of relationships between teams, the relationship to the user and how we should talk to each other.

Taking the time to understand not only the code that is written, but the people who wrote that code and their ideas, as well as those people who are attempting to teach us their own systems to encode into our software, those who will use the software and those who will try and take what we have built and explain it to potential new users so that they can also benefit from what we’ve build, hopefully departing with some cash to show their appreciation.

The skill that I would prize above all others is that which embraces the complexity of our lives and the lives of others, and seeks to find a harmony between us such that we can build a thing thing that is truly a useful contribution to our community — rather than what amounts to a technical work of art.

Understanding the market

Market (noun):
a regular gathering of people for the purchase and sale of provisions, livestock, and other commodities.

Our markets are a complex entity involving the calculation of “value” given the social, cultural and financial context in which a given user is operating.

Understanding some of these complexities can help us as software developers create products that deliver the appropriate value to the users that we’re building for. If we’re able to examine what’s happening in the existing market, and spot ways that we’re perhaps able to help users in our uniquely efficient way we’re able to create a product that has the chance to be much more successful.

Further, if we understand the market we’re able to position our product in such a way that users are able to reason about what it can do for them and in their lives.

This gives what we give life with our actual development the much greater chance of commercial success, giving us a better chance to further improve the service or develop new ways in which we can improve the lives of our users.

Mapping domain specific expertise into software

The value of software is not in its inherent nature to do a thing repeatedly, but that this thing does something of value to us. We use software for a purpose; to derive a value from it.

Human endeavors are invariably complex endeavors; especially now that the encoding of simpler processes such as sending a letter or writing up some literature for expression into the world have been well solved by services such as Gmail or Medium. The problems that remain to solve are far more complex, though the payoff to solve such problems is still enormous.

Unfortunately, though I like to try we as developers cannot seek to understand the entire world. There are those who spend that time that we spend furthering our own engineering capacities furthering their own knowledge in their discipline; in accounting, medicine, coaching, construction, marketing, logistics or any of the other complex processes that make up this social fabric of the world.

However, the success or failure of the software we build is how the software is able to deliver value that complements or replaces those domain experts — how well we’re able to map their understanding into software.

We thus need to spend some of our time learning our area of software development, and learning how the experts whose knowledge we’re trying to encode into software approach a given set of problems. We further need to reason about how we can represent this approach in software, and talk further to those experts so we explain the limitations of machines and how to overcome them.

The skills required to ship code

It is here that I would place the skills that I have previously spent so much time refining; the engineering aspect of our discipline.

As software developers we need to be able to take the abstract representations of things that we understand and build both their logical representations and user interfaces such that users can either learn or map their own representations of that same abstract entity onto the interface.

To do that requires an extraordinary discipline, coordination with other developers, stakeholders and the users. One must make changes with a consistency I have never seen in other disciplines, matching, adjusting or working with the logical representations of our forebearer’’s mental models.

It is perhaps most important to emphasize here that a developers work is only “done” when it is shipped and in front of users. Between conception and delivery (indeed, until after the software has been measured in production and it is verified working as intended) software changes are almost entirely hypothetical.

The skills required to run code reliably

Lastly, while it’s a simple conceptual model that we simply build and ship our software it is not quite how software pans out once it gets in front of users.

Users are wildly unpredictable, sometimes actively hostile entities. When developing software (particularly web software) one must both build and launch the software, and then actively manage it as users start using it, probing it and generally introducing it into their lives. These users combine en-masse to produce interactions that we never imagined and we need to ensure that our software still works as quickly and reliably as we predicted that it would, even though the requirements are not quite what we imagined.

We further must be aware that not all of our users will be friendly, and there will be users who will actively seek to find logical faults with our software and exploit them for their own gain.

In Conclusion

The building of software is, unsurprisingly, the vast majority of a software developers requirements. However, the building of software does not happen in a vacuum and the complexities of life, of people and their cultures and of the complexity of the problems we’re trying to solve all go into determining whether our built product will be successful. We as engineers need to reach beyond the borders of simply committing code to try and understand the context in which we operate and hold ourselves accountable to ensuring that our users are actually deriving the value from our software that we imagine they should.

Learning Path: Software Engineering

Andrew howden — Sun, 28 Dec 2025 07:55:36 GMT

I find myself coming back to a series of posts, books, talks and so on — this is my attempt to document them. The items are (loosely) hierarchial, but I’d pick and choose among them.

People

Papers / Articles

Books

Refactoring - Martin Fowler
Patterns of Enterprise Architecture - Martin Fowler

(The following are related, but not directly, software engineering)

The principals of product development flow

Related Paths

LLM Prompt: Writing Styleguide

Andrew howden — Tue, 14 Oct 2025 08:23:45 GMT

Education & Reading Level

The paper should be targeted at someone who has a university education (e.g. bachelor or masters) in the subject matter being explored, and should not shy away from the use of Jargon.

Styleguide

The writing should be simple, taking inspiration from books like “On Writing Well” by William Zinsser, or similar writing styleguides. As a rule, it should avoid using the “passive voice”.

Format

Prefer to make the content as a series of paragraphs, using dot points sparingly and only where the subject is useful to understand through a list.

Fidelity

Precision

Ensure that the paper is focused, and makes a limited number of key points rather than introducing additional subject matter. Where a paper has any meaningful length (e.g. more than one page), create an executive summary at the beginning of the paper that summarizes it.

Ensure that the paper is precise, and doesn’t use terminology to convey size without providing numbers, or where this is necessary as there is no numeric representation, use simple relative terms (e.g. “large” and not “huge”).

Writing a bug report

Andrew howden — Fri, 14 Feb 2025 09:52:10 GMT

As a software engineer, my stock and trade are bugs or features. My team and I need to make changes to the software to make it perform some new thing, or modify the behaviour of the thing it does continually, cycling releases through approximately 1x per week.

We don’t always get it right.

We take reasonable care in how we develop software; all changes are subject to code review, nearly all system changes are expressed in version control and all changes need to go through an independent QA process — all of which means we do okay. But sometimes, something slips through the cracks.

Invariably, the things that slip through aren’t the things that we thought of. They’re usually the result of complex system interactions, a limitation in the way we’ve constructed the business logic or some bespoke circumstance that modifies the state such that it breaks a fundamental assumption. Because they’re things that we both don’t think of and don’t test for, we must get as much information as possible to help us investigate the issue further.

Skipping ahead, when writing a bug it’s best if it’s written according to the formula supplied below in the “Markdown” format:

# The paypal checkout seems to fail with the error "Error: The upstream provider returned an invalid response.

## StoryOn Thursday, 25th of February 2018 at approximately 6:18pm I attempted to checkout via the standard checkout with the following items in my cart:

- 2x Widget
- 1x Foobar
I delivered to the address: Untermainkai 30

Frankfurt am Main 60320
Deutschland

And attempted to pay via the payment method Mastercard, with the last 4 digits ending in 0430.

I was able to complete the checkout up to and including the PayPal step, but on being sent back to the store the store showed only "Error: the upstream provider returned an invalid response".

I did not receive any order success email as I expected, nor was I shown the "Checkout Success" page.

## ScreenshotsPlease find attached the following screenshots which show:

1. The error following the checkout
2. The contents of my cart prior to checkout
3. The order summary page prior to being redirected to Paypal

## Technical Detail
### BrowserThe browser I used was Chrome. The details from the `chrome://version` tab are as follows:

```
Google Chrome66.0.3359.181 (Official Build) (64-bit)
Revision a10b9cedb40738cb152f8148ddab4891df876959-refs/branch-heads/3359@{#828}
```

### Browser Console

The browser console showed the following entries:

```
content.js:4 [Deprecation] chrome.loadTimes() is deprecated, instead use standardized API: nextHopProtocol in Navigation Timing 2. https://www.chromestatus.com/features/5637885046816768.
(anonymous) @ content.js:4
content.js:5 [Deprecation] chrome.loadTimes() is deprecated, instead use standardized API: nextHopProtocol in Navigation Timing 2. https://www.chromestatus.com/features/5637885046816768.
(anonymous) @ content.js:5
```

The above report is super detailed — I would be extremely happy to receive such a report. But it’s worth unpacking the report further, to understand why each section is important from the point of view of the developer.

Title

The title is surprisingly important. There are always many outstanding bugs that must be triaged, and it’s often not clear which task has the highest priority. However, we can determine from the title:

The paypal checkout seems to fail with the error "Error: The upstream provider returned an invalid response.

That the error is in the checkout; the last step of the user journey. It’s thus critically important, and is likely to trigger incident response internally. Additionally, we can see that it’s associated with as specific provider, and that from the error we can see the provider is performing abnormally. It may be the provider is experiencing issues, which can usually be quickly verified with status pages.

Story (or steps to reproduce)

Perhaps the most important step in a bug is detailed steps to reproduce. It’s best to write these as detailed as possible, and from the point of view of the person who has experienced the issue. Let’s break it down piece by piece.

Time and Date

On Thursday, 25th of February 2018 at approximately 6:18pm I attempted

As developers, we know ahead of time that we are not going to catch all issues. As far as I’m aware, there has never been bug free software, no matter the expense — the systems that we build upon are simply not that reliable. So, we build the application with this in mind. We collect certain diagnostic information at all times throughout the application, but we collect even more information when we have determined that an unusual situation has occurred.

We are able to look up the information associated with a request, but only if we know specifically which request it is. Sometimes it’s not so obvious which requests are problematic and which are not; the proverbial needle in the haystack. So, narrowing the problem down to a time range allows us to look very specifically at the information from a certain time, which can dramatically reduce the amount of effort required to investigate.

Detailed User Story

I attempted to checkout via the standard checkout with the following items in my...

...${SEE_ABOVE}...

...but on being sent back to the store the store showed only "Error: the upstream provider returned an invalid response".

As mentioned, bugs often arise out of complex behaviour. Unfortunately we cannot test all possible combinations of what will occur in a production environment during testing. There are a couple of reasons for this:

We don’t know them, and
It’s cost prohibitive to test every combination of every feature prior to shipping to production

We make tradeoffs on the cost of an issue in a given area, and shape our testing policy accordingly.

Detailed user journeys that clearly describe the application state as well as the actual error allow us to reproduce exactly the set of conditions that arose from the error. In the case above, it may be that 1x foobar contains a special character in it’s description that causes the application to break when it is returned from PayPal, or that the credit card has been marked as fraudulent and the application hasn’t been designed to deal with the issue.

If you take nothing away, please painstakingly detail the steps to reproduce the issue.

Screenshots

1. The error following the checkout
2. The contents of my cart prior to checkout
3. The order summary page prior to being redirected to Paypal

The phrase “a picture is worth a thousand words” is rarely more true than screenshots associated with an issue.

As much as we try and create a common understanding of what is happening in an application and surface information that is useful debugging, there is certain information that it is inherently complex to communicate. A screenshot is a low effort tool to communicate an extremely large amount of information:

Time of day
Operating system
Browser
Application state through status icons
User data that can be correlated with the issue

While it would be possible to ask for all of these details up front, it’s often difficult to explain the impact of different browsers, or why an operating system matters. Sending screenshots allows a vast amount of technical communication with an ostensibly simple action.

Further, multiple screenshots compound this effect. It is possible to determine state change over time from screenshots, and there may be things the developer can spot that are difficult to communicate.

Lastly, screenshots even make sense in terms of “terminal” or text only applications!

Technical Detail (Bonus)

There are some bugs that, no matter how much we instrument, are impossible to foresee and thus instrument for. If a bug report is being reported by a more technical user (for example, a project manager) it’s possible for them to communicate additional very technical information about the nature of the bug which helps particularly with bugs that are otherwise extremely hard to trace.

Browser Version

The browser I used was Chrome. The details from the `chrome://version` tab are as follows:

It is possible for a develop code that works perfectly well one day, but breaks the next. One such example would be the use of in-development web specifications whos format changes following the development implementation, or deprecation to other specifications. Lastly, there are a class of bugs in which the browser fails to implement the specification as documented, in which the browser itself will be problematic rather than the application.

By being specific about the browser version we can either rule out a class of issues, or otherwise reproduce an issue exactly as it’s defined. Additionally, copying directly from about or version sections of a browser allows us to capture information that may seem redundant, such as the (stable) part of the Chrome version string.

Browser Console

content.js:4 [Deprecation] chrome.loadTimes() is deprecated, instead use standardized API: nextHopProtocol in Navigation Timing 2. https://www.chromestatus.com/features/5637885046816768.
(anonymous) @ content.js:4

The browser console is a tool that is used by various browsers to express additional, non user facing information to interested parties — specifically information that is useful for the developers. There are various mechanisms to access them, but the in Chrome it’s Ctrl+Shift+J.

There is a particular class of problems with the code “JavaScript” which we often have very little visibility into. These issues are expressed through the JavaScript console, so adding these to the report can render an otherwise completely opaque problem transparent.

HAR File

For the truly technically savvy among us, it’s even possible to capture the entire request and send it through in a format called “HAR”. Instructions on how to do this in Chrome are at the following address:

https://developers.google.com/web/tools/chrome-devtools/network-performance/reference#save-as-har

Conclusion

Bugs are an inevitable part of software development. They are also among the most painful to investigate, and thus the most expensive issues that we can work on. By submitting a detailed bug report we can save a large amount of time and discussion back and fourth, and get fixes into production faster.

Thanks

Aario Shahbany and Svetlin Kalendzhiev who assisted in creating the “ideal bug report”.
Tomasz Kaplonski for reviewing and suggesting improvements.

On communicating through writing

Andrew howden — Mon, 06 Jan 2025 17:43:14 GMT

In February 2024, I joined a new organization as a “principal engineer”. This was a bit of an adventure — the last 12 months have been an opportunity for me to learn an enormous amount, as well to join a high profile project, sort through a number of substantial project challenges and establish relationships at different levels through this new company.

Whenever joining a new company, I’m always surprised at how different that company works. In this particular case, I rejoined a series of colleagues I’d worked with in the past, but in vastly different roles and with a very different customer base. It was a good opportunity to see both the culture that exists between my former colleagues and I, and the culture that exists in the broader group.

One of the principles that I take for granted is the use of documents as a vehicle for coordinating decisions. I am used to product, portfolio, technical, organizational and managerial decisions all having an accompanying (hopefully short) paper. This exists in pockets in the new company I work for, but it is by no means “common”. So! This article exists to try and explain why I am so used to using documents to drive a shared understanding, and ultimately, a clear decision.

The Fundamental Challenge: Developing an Idea

The goal of most communications is first to develop a shared understanding of an idea. Whether that’s “We should use Go as our primary software development language”, “This application programming interface needs an additional field which determines whether or not this application is a frontend application” or “We need a new team to build a load testing capability”. The second is to allow others to contribute to that idea, bringing their own knowledge and perspective to improve the assumptions that the idea makes or bring a new iteration of an idea that solves the underlying problem. Lastly, there should be enough clarity that a leader can make a clear decision that is well understood in the team.

There are different ways to “develop an idea”. The most frequent ones I come across are:

A “workshop”, in which team members share their understanding in a “brainstorm” against a common model (e.g. on a whiteboard).
A “document”, in which someone takes the time to write up their understanding

My experience so far is that a “workshop” is incredibly useful to gain a lot of perspectives quickly and settle on the “broad definition” of an idea. However, workshops usually do not provide enough clarity that a decision can be made “in the moment” — while participants may feel they share a good understanding of each other's perspective, frequently I find that when it is written into a document and re-read, what’s there is quite different than the expectations that each participant had out of the workshop.

Additionally, workshops miss at least two key opportunities for input. These are:

Inspiration. The proverbial “shower thought” or key moment of insight that happens outside a workshop, when your brain is just idling over the day.
Research. A structured analysis of what other people have done, and been kind enough to communicate in the past.
Feedback. Asking a broader group of participants outside the initial workshop for their input.
Experimentation. Having a go in a cheap way, before settling on a larger investment.

All of these can substantially improve the initially described, rough idea.

💼 Practical Experience: Earlier this year, I joined a very large, company wide project. This project used workshops early on in its delivery to make key decisions, but only some of the decisions were extensively written up. Later, it turned out many of the initial assumptions didn’t pan out, and the project ran into substantial trouble. The project did not have clearly defined checkpoints or moments to course correct. It was put back on track by making clear, key decisions and aligning with senior stakeholders on the new requirements — the majority in time, in writing, and with substantial earlier preparation.

Communicating an idea

Let’s assume that an idea was clearly defined, if not articulated, in someone's mind. There are different ways to convey that idea, each that can solve a different problem. These include:

Verbally

Verbally communicating an idea is extremely cheap, requiring almost no preparation and providing the opportunity for the communication partner to take a directed path toward understanding by asking questions.

Unfortunately, verbal communication scales very poorly. While it is possible to communicate verbally to a large number of people, this substantially increases preparation cost and decreases the opportunity to ask questions with larger audiences. Additionally, verbal communication happens at a point in time, which means if someone is not present at that point, they miss it.

Lastly, while it can be very fun watching a compelling speaker speak passionately about their idea, that passion can bias us toward endorsing a fundamentally weak idea from a fundamentally strong speaker.

With Slides

It is possible to overcome some of the challenges speaking verbally by providing visual aids. This enriches the communication with additional context that can help reinforce the speaker's meaning, and provides some level of retrospect ability to the content later.

This is very useful to communicate pre-aligned ideas, but is less useful for developing them further. It also suffers from the “point in time” problem.

💼 Practical Experience: In October 2022, I and a colleague (Salome Santos) spoke on the topic of how our reliability culture and practice evolved over time. While the talk seemed to be well received, it was the product of 100 hours of work getting the talk written up as a script, making slides and practicing multiple times before delivery. Writing was at the core of the slide delivery.

With Video

It is even further possible to create engaging material by using a recorded talk, and then using video editing tools to provide more complex visual aids for the speaker. Speaker awkwardness can be removed, such as long pauses or other linguistic challenges.

It additionally addresses the “point in time” problem that slides suffer from, but is still not very useful for developing an idea.

💼 Practical Experience: In 2023, I created a course called “practical introduction to observability”. Similarly to the slides, it used writing as the core mechanism to align content. However, to turn writing into video was about 80x the initial writing time. This is a key reason why the content has not evolved quickly, and why I have not produced additional material.

In Writing

Easily my favourite way of communicating. Communicating in writing is substantially more expensive for the author, needing both to articulate their understanding and then to review it before it is published to others.

However, it is far cheaper for the readers. People can frequently read much faster than they can listen to a speaker or watch a video. While text doesn’t inherently provide visual aids, it is possible to enrich it with diagrams (or even videos) to provide the required understanding.

Writing also does not suffer the “point in time” problem, as writing can be read at any point in time — whether immediately after it has been written, or months later as a given idea is being reviewed for its efficacy. It can help us “shift back” in time to understanding our thinking at that point, rather than us trying to imagine it with the biases that our more recent experiences have created.

Lastly, writing is extremely cheap to iterate on. A typical flow for a document might be:

Write the first draft, put it away. Read it in the morning. Edit it some more.
Send it to a colleague for their review. Edit it some more.
Send it to an immediate manager for review, in conjunction with experts from a small community. Edit it some more.
Send it to a broader community for review. Edit, and “publish” it — indicating there will be no more edits, and the decision is made.

This is something that no other communication medium can deliver on.

💼 Practical Experience: This article has been sitting in my inbox for about 4 weeks before I finally took the time to write it up. In the meantime, I added notes on what I wanted to say as I thought of it during my commute, or in discussions with those around me.

Point in time reading

While it is possible for people to read documents at any time, it is frequently the case that people do not prioritize reading material without dedicating time to it specifically. To overcome this challenge, we can employ a “silent meeting”.

A silent meeting is a meeting in which all participants are given the document, and a reasonable amount of reading time (usually 50% of the meeting time). They can then read through the document, raising feedback through that meeting.

At the end of this silent reading period, a moderator will go through the comments in the document, selecting those that seem especially useful for further discussion. The group will discuss each document, after which a note will be made and the author will (later) refine the document based on the group's feedback.

This works extremely well for technical design documents, product discussions or other substantial engineering decisions. Learn more at “running an effective meeting”.

💼 Practical Experience: As part of a project this year, I needed to align with multiple senior stakeholders on how to collaborate and determine a path forward. To do that, I wrote up the current challenges, contributing factors and recommendations and held a “silent meeting”. While there was discussion at the end of the meeting, this conversation anchored all future collaboration, and provided both parties with a clear path forward.

Writing Tips

Communicating via writing is a skill that, like all others, takes practice. This blog is my attempt both at communicating ideas, and improving my ability to do so! However, there are some practical things you can do to improve your writing quickly:

Write toward a target audience. Have a specific group in mind when writing, and review your writing against whether those people will understand it.
Keep your documents concise. While it is useful to “think” in long form, it is awkward to read. The less fluff in a document, the better. As a rule, a document should never exceed 6 pages, and a decision never more than 1.
Remove complexity. When writing, it is possible to repeat yourself multiple times, to use language that is bespoke (or “unusual”). Try and remove as much of this as possible, leaving only the simplified essence behind.

Bonus: Large Language Models

This past year (2024), large language models such as ChatGPT or Gemini have become substantially more prominent. These models excel at digging through an abundance of documentation for fairly vaguely articulated ideas. Ideas written up can be surfaced by these tools, giving readers that “directed path” through the material.

Additionally, large language models can be very useful as editors. This document was edited based on feedback from prompts such as:

Consider this document as a reader who has a professional background, but is earlier in their career. Would you find it interesting? Why or why not? What would make it more interesting?
Consider this document as a reader. What information in it is redundant or repeated? What can be safely removed without hurting the "core message"?
Consider this document as a graduate level reader. What language or terminology is unclear? How could that language be simplified?

That’s it. Happy monday fam!

Anatomy of a “Good” commit message

Andrew howden — Fri, 01 Nov 2024 14:31:27 GMT

💡 This old post was originally posted on Medium on April 14th, 2018. However, it’s so useful that I moved it to the place where I keep my new posts! Enjoy, with the caveat I’ve grown some since I wrote it.

Git is a tool that’s fundamental to my software development workflow. In the five years I have been a developer, I have swapped out almost all my tools, but I have found nothing super to git. Its adoption, tooling, speed, and reliability have made it a supremely difficult competitor to beat in terms of version control.

One of the more useful features of git is the “Commit Message”. As each change is applied to the software repository, it is annotated with a message. Users can put whatever they like in this message, however, some practices make it much easier when reviewing the history of the repository.

Let’s evaluate this by taking a look at a (fake) commit message:

Introduce the widget to handle image creation

The project owner has requested the ability to attach images to user
profiles. Currently in this project, while users can add images to their
own profiles, administrators cannot add an image on behalf of the user.
Administrators can create users and full out certain user details, such
as name and email. However, with the introduction of a policy to ensure
all users of the system have an up to date photograph attached to the
profile, the administrator should be able to attach this photograph at
the time of the users creation. This commit introduces the widget in the admin section that handles this capability. It reuses the existing objects that exist to represent user images, only providing a new pathway by which they might be uploaded.

== Stakeholder Impact ==
=== Project Owner ====

This will allow the project owner to add new images to user profiles.
In this case the primary users of the application are employees, and
this will allow an easier publication of the welcome packet sent to the
team about new employees to ensure the employee transition is easier. Additionally, the administrator will have a picture associated with all
users; useful for company directories or the like.

=== Users ===

By having a picture uploaded on their behalf before they start using the
system users are not required to immediately learn an unfamiliar system
to upload their own image prior to the dispatch of the welcome packet.
This allows the welcome packet to be sent earlier, and a smoother
transition into employment.

== Design Notes ==    
=== Sanitisation by image copy ===

In order to prevent any inadvertent vunlerabilities (such as the
embedding of PHP code in EXIF metadata) the image is not stored in its
original form, but rather put through a converter to drop EXIF or other
steganographically stored data.

BREAKING CHANGE: Modification of the interface for the User object
                 constructor

Whoah. That was large! Let’s break it down piece by piece.

The subject line

The subject line is the first line in the commit. It’s used to show a small title of the commit in summary views, such as:

$ git log --pretty=oneline # amended

cd6d940 (HEAD -> master) AD-HOC refactor (deployment): Deploy automatically on master
810f662 AD-HOC fix (Prometheus Configuration): Update port used for Prometheus connections
6d2c979 AD-HOC feat (Prometheus Config): Add confluence server

You can see there above a coupe of bad commits, and a good one. In the case of our commit:

Introduce the widget to handle image creation

The goal for the subject line is to provide a concise summary of what the commit is about when reviewing commits en masse. Given this, some guidelines are:

Make it short: I usually aim for ~72 characters long
Be descriptive: Be specific about what changed. Fix bug is not super great, but Modify the import category object to nest subcategory arrays is.

You’ll see the angular commit guidelines in various places I code. I don’t feel strongly about these, they’re just part of the spec at Sitewards.

The Commit Body

i.e. “the rest of the commit”

The commit body is where we can provide context about the commit itself. I usually break it down into several sections:

General Background

The most important aspect of a git commit message is to provide the context around a code change. In our fake commit the example is below:

The project owner has requested the ability to attach images to user
profiles. Currently in this project, while users can add images to their
own profiles, administrators cannot add an image on behalf of the user.
Administrators can create users and full out certain user details, such
as name and email. However, with the introduction of a policy to ensure
all users of the system have an up to date photograph attached to the
profile, the administrator should be able to attach this photograph at
the time of the users creation.This commit introduces the widget in the admin section that handles this
capability. It reuses the existing objects that exist to represent user
images, only providing a new pathway by which they might be uploaded.

As you can see, it’s lengthy. However, it’s our only opportunity to give the people who will be maintaining the code in future the necessary context behind the changes that we made.

Some guidelines for this one are:

Break at 72 characters: It is much easier to view in primitive tools such as the CLI, is the format expected on mailing lists and is well supported by tooling. While more modern tooling is less restrictive, it’s a nice nod to our computing past.
Write in the imperative: A git commit is a change (or “patch”) to code. A commit message is attached to that change — not the code itself. Accordingly, when you write a commit message you are writing it as if it’s about to be applied, rather than about what you just did.
Use consistent terminology: after many years of working with a project, or even many projects, it’s sometimes hard to track what a developer meant with a word in one case compared with another. For example, “administrator” may mean developer, project manager, project owner, the staff working on the project or special users. Settling on canonical terminology makes it much easier to understand changes over time, as well as search the repository.
Use a standard markup format: Whether it’s Markdown, MediaWiki, Restructured text etc. It’s useful if a standard markup format is used in git commits. While it’s unlikely to be rendered, it provides guidelines on how to structure lists, headings etc which make it clear how the content should be written.
Provide as much context as you can: It’s super hard to understand what was going through a colleague’s mind (or even your own) 6 months after the code has been committed. Providing the context allows understanding of why the code was changed, not simply how.

Though it’s not usually necessary, we can even go so far as doing ASCII diagrams or other lists or other useful structures in a git log. Whatever is required to convey the context behind the commit.

Additionally, the guidelines here apply to subsequent sections.

Stakeholder Impact

Another large section:

== Stakeholder Impact ==
=== Project Owner ====

This will allow the project owner to add new images to user profiles.
In this case the primary users of the application are employees, and
this will allow an easier publication of the welcome packet sent to the
team about new employees to ensure the employee transition is easier.Additionally, the administrator will have a picture associated with all users; useful for company directories or the like.

=== Users ===
By having a picture uploaded on their behalf before they start using the
system users are not required to immediately learn an unfamiliar system
to upload their own image prior to the dispatch of the welcome packet.
This allows the welcome packet to be sent earlier, and a smoother
transition into employment.

The stakeholder impact allows us to both mentally self-check and restate the intended goals of the work. By writing up the impact on the people who are associated with this work, we clearly describe what we intend will be the outcome once the changes are merged as well as to whom and why the changes matter.

Some tips for this section are:

List all stakeholders before writing notes: By listing all those involved in a project before writing how our changes will affect them, we ensure that we do not skip those who might not occur to us on first thought, and spell out the implications for those users.
Restate the goals of the work in the context of the stakeholder: Too often it’s easy to get lost in the implementation of the work rather than the impetus that started it. I have adjusted more than one commit as I have realised I forgot or misunderstood something as I was committing it.
Omit stakeholders you deliberately haven’t considered: Sometimes, changes simply don’t concern a given stakeholder. Project owners often don’t care about server configuration changes or instrumentation improvements — but developers do. In omitting them it’s clearly communicated they’re not the intended audience for the change.

Design Notes

== Design Notes ==
=== Sanitisation by image copy ===
In order to prevent any inadvertent vunlerabilities (such as the
embedding of PHP code in EXIF metadata) the image is not stored in its
original form, but rather put through a converter to drop EXIF or other
steganographically stored data.

When doing any sort of development work, we make tradeoffs between various factors that we are implementing. However, these tradeoffs are not visible to users who are reviewing our code either doing a code review or simply when trying to understand the code at a future date.

By explicitly stating these tradeoffs, we add additional information that may help future developers as they revisit this code, or try and write other systems that are dependent on this system.

Some tips for this are:

Answer questions in design notes: Whether in code review, chat or any other tooling try and answer questions by adding them to the design notes, rather than simply replying inline. In this way, answers are recorded for all future developers rather than simply for that conversation.
Make notes during development: Sometimes, when development work is particularly in-depth, we forget the tradeoffs that we make as we write the code. Make notes during development about decisions you have made so they’re much easier to record in the commit.

Breaking Changes

BREAKING CHANGE: Modification of the interface for the User object
                 constructor

This section makes it clear when things have changed that other users may have to be aware of, either when accepting the patch or deciding on a version under which to release this software.

Making that easy

The above is super hard to remember. I would find it impossible to reliably implement it all the time. However, git allows contemplating of commit messages! In this, we can add helpful pointers to let us remember this and other guidelines. For more information, see the following article:

https://medium.com/sitewards/git-tips-template-your-commit-messages-187d8a2051b8

In Summary

Git histories are an incredibly valuable tool. However, it’s sometimes not clear what delimits a “good” commit message from a “bad” one. The above is a rough standard that I try and reach while developing, and one that I have found pays off within a few months.

Thanks

Tbaggery, whose guidelines I shamelessly rip off here.
Matthew Gamble, who originally educated me in great pain about these things.

Should you go to SRECon?

Andrew howden — Mon, 16 Oct 2023 15:30:50 GMT

On Friday, 13th October, Exhausted, I returned from SRECon EMEA 23. A conference I have been to twice now is one of my favourite ways to spend time and the most critical professional periods of my career. However, of all the people I know and work with, only a tiny slice go to these conferences each year. This post is for people who do not but who are interested in whether they can / should in future.

Let’s start by looking briefly at this conference.

SRECon

SRECon is:

a gathering of engineers who care deeply about site reliability, systems engineering, and working with complex distributed systems at scale. SREcon strives to challenge both those new to the profession as well as those who have been involved in it for decades. — SRECon

It is a problem-oriented conference (as opposed to a technology-oriented or vendor-oriented conference). It thus tends to have a wide range of talks, approaching software operations from different sides. This year, there were deep dives into technology (e.g. eBPF), reliability, domain (e.g. a Telco), learning and development (e.g. Scale your Future) and organizational change (Tracing the journey of Distributed Tracing).

The immediate question is, is this a valuable conference? Moreover, is it sufficiently valuable to dedicate a week of work, flights and hotels? For this particular conference, I obviously think the answer is “yes”. But let’s dive into the value of a conference more generally and how you can maximize it while you’re there.

Heads Up 💡 One of the reasons that SRECon is such a good conference is because the organizers truly believe and commit to delivering on the USENIX Mission. The USENIX members and sponsors support this mission and make this conference (and the YouTube content) possible. If you appreciate this material, consider donating or becoming a member. I picked membership!

Value

Most of the “work” that seems to go around a conference is in scheduling and planning the talks that go into it or choosing which talks to attend. However, there is vastly more that you can do at conferences if you keep your eye out for the opportunity. For me, the most valuable aspects of a conference are:

Meeting People
Sharing Knowledge
Consuming Talks
Sightseeing

Let’s dig into these a little bit.

People

It’s hard to overstate quite how much technology is a people-driven industry. Often, a handful of people write technology that changes the landscape of software engineering. People meet at conferences, discover a shared need, and work across organizational boundaries to try and find a standard solution to that problem. People are actively scouting talent to solve deeply challenging engineering problems. Invariably, some (like me soon1) use these conferences to try and find our next opportunity.

Working with people is slightly different from working with technology or a bureaucracy. People tend much more quickly to picture those they’ve seen or interacted with as partners, employees or managers. People weigh the ideas of others with greater heft if they’ve seen their peers react well to those ideas and will pay progressively more attention to them the more others also pay attention.

Attending a conference can be a massive boon to hiring or your efforts to find work simply because it exposes you to many people. At worst, the more you’re exposed, the more likely you’ll find someone looking for the particular kind of magic you’ll bring to an organization. At best, you’ll find new colleagues who bring the experience you sorely need within your organization, or perhaps you’ll find an organization that understands the value you can bring in a way your existing one struggles to.

There are ways in which you can work to maximize this value. These include:

Meet new people

Suppose you are attending a conference in another country and in a new environment. In that case, it can be extremely compelling to hang with people you brought along for the journey or meet regularly at these conferences.

My recommendation is to avoid this, if possible. The challenge here is that there are many ways to catch up with people you know already but few ways to get to know people you do not. A conference brings many people together with a shared interest, thus the perfect opportunity to meet these new people.

There’s no graceful way (in my experience) to go about this. Easily my favourite strategy is walking around with a coffee until you spot a gap in a group, and then ask at a convenient moment, “Do you mind if I crash this conversation?”. You will sometimes get uncomfortable looks, and you can move on — but far more often, you’ll get “Of course! We’re talking about …”. Once you join, listen in, ask some questions, and if the opportunity arises, share your experience. Be a polite “conversational guest”.

Lastly, if the opportunity comes up toward the end of the day, invite people out for dinner or join a shared dinner. The conversations at dinners can go much deeper, and you can build a much more lasting connection over fish and chips and a glass of wine.

In Practice @ SRECon
At SRECon, I met quite a large number of people this way, and had some tremendous conversations with colleagues from Microsoft, Reddit, Slack, Pragmatic AI Consulting, Nobl9, Lightstep (now ServiceNow). Mostly I just listened, and asked questions, and tried to bounce the conversation to include other listeners. As a result, I now have multiple new connections for both my courses and when I’ll need work again 🤓

Grab LinkedIn

As you’re talking to different people, there’s a natural moment in which the conversation will end — going into a talk, for example. At that moment, if you’d like to catch up later, you can always ask for people’s LinkedIn. Something like, “Oh! We could connect on LinkedIn if you like?” is usually sufficient; people can always refuse this with something like “Ahh, sorry, I do not have mine handy and need to run!”.

Following the conference, you can go through LinkedIn and figure out if there’s anyone who made a connection that you’d like to reach out to later. LinkedIn helpfully sorts them by recency.

In Practice @ SRECon
Here, I didn’t do this as much as I should have for this year — I learned from others. However, at the end I discovered the recency short, and messaged all those who were kind enough to reach out. Quite a number of us live in Berlin! I fully intend to reach out and arrange a gathering.

Sharing Knowledge

To understand why sharing your perspective with the community is so valuable, it is first essential to understand one of the weirder dynamics of growing within our careers. Beyond a certain career level, hiring the person who will be best in class in a given position is far less essential than ensuring you do not hire the person who will be destructive in that position.

This is perfectly reasonable given some thought about the organization itself — an organization can definitely survive being less efficient. Still, it cannot survive if the organization is torn down and stops delivering value.

That begs the question: How do you identify the people likely to be “sufficiently valuable” with limited risk of being “destructive”? Well, you ask them for their perspectives on a series of complex tradeoffs and ensure their tradeoffs and decisions align with yours (or are at least logical and defensible).

That brings us back to sharing knowledge. The challenge of people looking to hire senior colleagues and those running the conference is similar — find people who have to solve a challenging problem and ask them how they did that. The boon for you, as the speaker, is that by sharing your perspective across the audience (and on YouTube), you’re proactively sharing your tradeoffs and approach. In this, you’re exposing yourself to a wide range of people looking for that candidate and who might approach you based on that experience — even years later.

In parallel, it’s nice to give back to the engineering community as so many others are so kind with their knowledge. You might ignite an engineer’s curiosity and fundamentally transform their perspective! I have seen many talks that have changed mine. Additionally, nothing teaches you a topic like knowing you will have to present it and answer questions in front of up to a few hundred people.

Like before, there are a few ways to maximize your value here:

Submit the Call for Papers

You must submit the call for papers to talk at any conference. The reviewer decides whether or not to include your talk based on these proposals, so take them seriously.

What to submit is a challenging question, but I would look at the lessons you personally have learned over the past 12 months and take the opportunity to submit those. Don’t worry if someone else has already published material on this before — we are all at different points in our growth journey, and your version might be what someone else needs to hear at theirs.

You will face rejection in the majority of cases. There are too many enthusiastic speakers! But knowing that ahead of time is helpful, so you do not feel disheartened when the “no” comes back.

In Practice @ SRECon
This year, I submitted one full length talk that was rejected:
And three lightning talks, two of which were rejected:
You can find the talk I did on stage duplicated on YouTube.

Practice

Let’s assume that you’re lucky enough to have a talk (or a lightning talk) submitted. Congratulations 🎉 You’ll then need to write that talk. Without going into detail, the process I usually follow is:

Write a post or a script
Practice reading that aloud, editing along the way.
Write slides

After that, in theory, you’ve got “everything you need” to present! However, presenting has one key factor that I (at least) find deeply challenging to overcome: sheer bloody panic. Presenting in front of people is terrifying; my legs shake, I speak too fast, and my mind goes blank. I rarely remember the talks themselves.

I’ve found only one solution to this over the years: Practice. Practice a lot. The reason to practice is to overcome that panic by “falling back” on the version of the talk you’ve wired into your brain. At best, you do not need it, but at worst, you can at least convey the talk you intended to convey without adjusting it to variables on the day. So, practice it, practice speaking it, practice it in front of your dog, practice it with a clicker and then practice it to a small community. Practice it until you’re hearing your voice in your sleep.

Lastly, when you’re at the conference, figure out where and how you’ll present. Figure out what mic you’ll be using, whether you can take notes, whether there’ll be a “confidence monitor”, and so on. Ideally, take a break to walk on stage in the empty room and test your computer's work with the HDMI. All of this minimizes the panic-inducing moments.

In Practice @ SRECon
This year, I only did a lightning talk and did exactly as I described above (though it was only 4 minutes). I was still over too quickly.
Last year, however, I did a talk with a colleague (Salomé Santos). We had practice the talked exhaustively before the conference, but shortly before Salomé needed to present remote! In talking to the AV team, this meant that we would have no slides. Additionally, the whole talk setup was very unusul.
We adjusted to these constraints by falling back on what we’d practice. While I felt nothing but panic during the talk, it all worked out beautifully for the conference itself. I later got the opportunity to speak at another, industry internal conference as a result!

Talks

Lastly (for the conference itself), we have the talks. I put the talks last because, thanks to the USENIX mission, the members who support it and the conference sponsors, they’re available on YouTube afterwards. That’s not to say there are not valuable to do on the days — they are — but given a choice between making a connection and viewing a talk, I’ll choose the connection and watch the content a bit later.

That said, talks are still super helpful. First, and with some irony, they’ll bring together other people, even within a conference, who are interested in a topic you’re also interested in. That makes it an excellent opportunity to connect with those people (or the speaker) to ask further questions or discuss their challenges.

Second, I’ve lost count of the times I’ve watched a talk and had my perspective fundamentally shifted. Talks on Human Factors and Safety Science were recently my most substantial shifts and structured how I ran Embedded SRE for the last couple of years.

As always, there are ways to maximize your value from talks.

Seek Value

At a conference, you’re generally there for specific work reasons. That means to get the most value out of topics; you should pick the topics relevant to your job (or the job you want) that help you understand a challenge you are facing in a new way or provide you with peers to discuss it.

This means not picking the talks because the speakers are compelling or because you are excited by the topic. This is the hardest challenge and one I regularly fail, as I deeply love digging into the Linux kernel and technology challenges around infrastructure. Still, my job focuses on enabling software engineers to own much more frequent, less complex operational challenges.

In Practice @ SRECon
At SRECon I joined some talks on topics that I am currently working on, such as Observability. It was super helpful as I was struck there’s still a substantial disconnect between how operations people see Observability versus how product engineers see it. This is encouraging for me, as this is the problem I set out to solve with the Practical Introduction to Observability. I also learned quite a bit more about Machine Learning observability, and made a connection with an expert in this domain (Lina Weichbrodt)
However, I also failed here and joined talks on Chubby (and its history) as well as Sockmap in the kernel. In my own defense, there’s only so much I can focus without joy 😅

Thank the Speaker

Speakers are people too, and people who just went through an emotional ringer in front of many other people. You can help brighten their day by thanking them for their talk and pointing out a concrete insight you took away from it. You can ask if they (later) want feedback on it, or if you have only positive feedback, pass that on directly.

You’ll be surprised how many speakers are grateful for this! I know I sure am. You might not get answers immediately (or at all) — speakers tend to get bombarded with questions after a talk.

Sight Seeing

Last on the list. Personally, the fundamental reason I started going to conferences was that I was too poor to travel independently. I figured out that the conference ticket is free if I speak, and I can get work to (at least) chip in for my flight under “employer branding”. Its no additional cost for the employer to send you a few days earlier, so you can get your own hotel for the extra couple of days and look around.

My financial situation has somewhat improved, but some of my life's best travel experiences have been just before or just after a conference. My wife and I will go, explore the city for a little while and then I’ll head to the conference. We’ve been to a bunch of remarkable and unexpected places like this!

In Practice @ SRECon
My wife and I went a few days early, and did the Guiness tour, Viking bus and quite a few dinners and afternoon cocktails. It was great fun! After that, she went back and I went into “full conference mode”.
Right now, I have no employer so there is no financial benefit for doing this way. But we have the habit, and its a nice one to keep.

Challenges

So! We’ve talked extensively about the value of this conference, how to maximize it, and examples from this year’s SRECon. However, at this point, you might think, “This sounds all great and well, but I cannot do this because …”.

Let’s chat about some of the challenges that I (at least) faced

Cost

The first and most substantial challenge is cost. SRECon this year cost me ~3,000 € all up. 1000 for the tickets (approx), another 1000 for the hotel, 150 for the flights (I live close), and a few hundred for food and incidentals. This is an enormous outlay of cash, no matter your financial position!

Fortunately, there are definitely ways in which you can limit how much you spend.

Learning & Development Budgets

In this post, we’ve talked extensively about the value of conferences, settling on the primary value being around people. For companies, the highest ROI for conferences is sending people for hiring rather than learning. Still, the shared understanding that conferences are educational means that we can usually use our “learning and development” budgets to attend.

Often, just the ticket will exhaust or multiply the budget. However, these budgets are often pooled between 30 people, and companies can inadvertently make them hard to reach. This means that if you ask toward the end of the year when the budget is essentially “doing nothing” and will shortly expire, it’s easy to argue to spend it all on a conference ticket.

Speaking

So far, we’ve discussed speaking as primarily valuable if you’re looking for either work yourself or some new colleagues to work with. However, one pragmatic upside of speaking is the free ticket.

Additionally, companies can look at speakers differently than they look at attendees. This means working with corporate communications to ensure your talk is aligned (which, if you’re in a large org, you need to do anyway); however, it also means there’s a separate and often much larger budget to tap for your flights and hotels.

Speaking is how I used to be able to afford to attend conferences!

Sharing

While not something you can do beforehand, it is much easier to convince management figures that a conference is worthwhile if there is clear evidence that attending it drove positive change within the organization. Notably, this change doesn’t have to come from you.

You can use what you learned at the conference to try and drive change through your organization simply by summarizing your discoveries for others to consume. That gives people an “editorial view” of the material and helps them discover relevant things specific to their context. This is illustrated in an exceptionally amusing way by Tiarnán de Burca in this Slack thread.

Writing this up is cheap and can be done the day after the conference!

In Practice @ SRECon
This post was inspired by a discussion within a chat on this very topic. Hopefully it helps people understand just how valuable these conferences can be!

Time

The conference takes a significant amount of time away from your professional work. If you are working in an organization with strict deadlines, this can be challenging. In terms of getting that time approved, the approach is similar to the cost issue — try to clarify the value of going and sharing the knowledge you gained when you return. If it is “ambassadorial” work, it is much easier for companies to pay for than if it is “learning”.

That said, at the outset, be clear that this is not “work time” — you will be entirely unavailable during this period. The value of the conference is in the people, and when and how to meet with them tends to be unpredictable. You should not plan work at this time.

Aside from that, plan the time away as you would a holiday!

Exhaustion

Lastly, you'll quickly discover a challenge unique to the conference itself: Exhaustion. Essentially, you’re heading away from home to an (often cheap) hotel, and you’re not going to have the local coffee or food shops that you’re used to. You’ll be joining peers from across the industry, trying to put on a witty or polite facade for 12 - 16 hours a day over multiple days.

During this period, it is critical to be kind to yourself. Find a way to get space if you become overwhelmed, ensure that you get enough sleep and limit your alcohol intake to one that doesn’t interrupt your sleep (or interrupt your “ambassadorial work”).

Lastly, take care of things like the Coronavirus (or other kinds of sicknesses). If you are sick, stay away. Brutal though that is, by joining anyway, you’re probably taking out a large chunk of the critical engineering staff across multiple companies across a couple of weeks. When you’re there, you should feel empowered to take measures for your safety, such as wearing a mask, frequently washing your hands or working with organizers to ensure the rooms are adequately ventilated.

In Practice @ SRECon
This year, the primary challenge I faced was food. I train a few hours daily at the moment, and no food means I quickly get “Hangry”. I quickly found a local bakery which does good coffee and sausage rolls, and I disappeared there when I got overwhelmed.
The hotel itself was far away, so I blew my budget on Ubers at Midnight. Money well spent — money is cheaper than sleep — and I’ll book a closer hotel next time.
Lastly, I skipped a few talks I had intended on joining when I wasn’t able to be my “present and jovial self”. They’ll be on YouTube anyway, and I saved myself for dinner or the break.

In Summary

The question asked at the outset — Should you go to SRECon — is hopefully answered. For me, at least, the answer is overwhelmingly “Yes”; so much so that I sponsored my ticket this year (as I am unemployed). Hopefully, this article helps you understand why it is so valuable, how to maximize that value and how you can overcome some of the organizational inertia required to attend.

Probably not till December or January. I want to focus on getting this Practical Introduction to Observability done.

Suddenly, you’re in charge. How do you create direction?

Andrew howden — Mon, 04 Sep 2023 06:00:11 GMT

A story about how I've learned to use strategy as a common tool

Suddenly, you're in charge. Now what?

Andrew howden — Fri, 01 Sep 2023 14:52:01 GMT

This one is a pretty long read. You can find it broken up into chunks on YouTube on YouTube! More of a listener? I’ll upload that shortly.

I’ve found myself suddenly responsible for a large group of technical people a few times. I’ve done this in preparation for CyberWeek, driving reliability projects across the checkout and in the formation of my team. I want to talk today about some of the tools and techniques I use to create a shared understanding, sense of purpose and direction in the people I am working with to deliver … something or other.

The sharpest, hardest lesson I learned was with the creation of Embedded SRE. In 2019, we hired four engineers with a broad mandate: “Improve the reliability of the transactional experience”. We’d be four people working with over 160 engineers, managers, technical experts and other contributors. We somehow had to come to terms with this and develop some semi-reasonable plan to make that area more reliable. For the first five months, we essentially “span wheels” — we did some operational work, but nothing substantial.

Fortunately, the engineering director in this domain, Andrei Gherasim, came to my rescue. He and I collaborated on a document called “Mission and Strategy 2022”. This document covered the scope, purpose, engagement model and broad deliverables my team would deliver over the next nine months. We kept reviewing it until we had it in good shape, after which we started sharing it with the engineers, directors and VPs my team would ultimately engage with.

That document changed the course of my team. Suddenly, rather than needing to negotiate each commitment individually with the stakeholders, we had a transparent contribution model and a path to choose when (and when not) to intervene. We had expectations to fulfil ourselves, and leadership within each domain could fold us into their operational plan. That strategy worked beautifully — for about a month or so. Then, a new operational challenge popped up, and we suddenly had too much to deliver on. We half-delivered a bunch of stuff.

2023 rolled around, and we did a review of that strategy. We identified what assumptions we had initially that didn’t bear out and what underlying theory we had that turned out to be wrong. It was a hard one to go through. I wrote up a new strategy that factored in our challenges in 2022. This strategy was more successful than the previous one; we made systems substantially more reliable. The process had room for improvement, but this structured approach was vital for my learning and the team's evolution.

Recently, I’ve been writing up another strategy for the course. It has helped me shape my approach, and I’ve iterated on quite a few it's still on paper. Given this, I thought I’d share my general approach to disambiguating the unknown based on the last couple of years of figuring this out for my commitments. Who knows, maybe it’ll save you months of stress as it did me!

The Problem

To figure out how to write a strategy, we must first understand the problem we want to solve by writing a strategy. In his book “Good Strategy, Bad Strategy: The Difference and Why It Matters,” Richard Rumelt squares this away on page #1:

“The core of strategy work is always the same: discovering the critical factors in a situation and designing a way of coordinating and focusing actions to deal with those factors”

In my experience, an organisation (whether it’s a team, department or whole company) is organised around a small set of fundamental problems. Substack lets independent writers and podcasts directly publish to their audience and get paid through subscribers. Netflix enables people to watch the latest TV immediately on their devices. In the upcoming Site Reliability Engineering courses, I want to empower engineers to run software reliability through a reference approach to software operations.

There are usually many, many different ways to solve a problem. A large subset of these will get the job done at some point, and a smaller subset will be as efficient — if we execute that idea exclusively. To solve a problem, everyone needs to rally behind a single approach. The correct method for any organisation depends on its strengths, weaknesses, capabilities and challenges. So, the next step is to figure those out!

As a necessary caveat, I’m still learning to write “strategies” and lead people. It’s a fun adventure, but some positions might surprise you if you’re a more experienced leader. That’s fine — just correct me in the comments, and I’ll have learned more!

SWOT

In 1965, three colleagues (Robert, Otis and Arnold) designed SWOT to appraise an organisation's strengths, weaknesses, opportunities and threats. While it is a little older, it is still handy for inventorying what we need to consider when crafting our strategy. It is also a fantastic way to open our strategic planning to a larger community, sourcing perspectives from diverse people across job roles.

A SWOT analysis consists of a 4x4 grid. In this grid, we label a section for each component:

Strengths: Characteristics of the organisation that give it an advantage and can be leveraged
Weakness: Characteristics of the organisation that give it a disadvantage and need to be overcome
Opportunities: Something in the environment that can we can exploit for an advantage
Threats: Something in the environment that poses a risk to the organisation

For solo projects, a landscape document is acceptable; for collaborative work, a whiteboard (or virtual whiteboard) to which “sticky notes” can be attached works well. I have more valuable insights when doing something other than my everyday work (such as cleaning the house or walking to the shops), so I stick it somewhere accessible and add to it over a few days.

Let’s take, for example, my current goal to enable engineers to run software reliability through a reference approach to software operations. I have some strengths, such as extensive experience running software and helping 160 teams at a large eCommerce company do the same. I have a small social media following, access to multiple engineering communities, relationships with many engineering leaders in this domain across Berlin and some experience creating content. I also have some weaknesses, such as challenges focusing on a topic for an extended time, inexperience in producing educational content, no experience in marketing or monetising this content or the isolation that comes with working solo on a project. There is opportunities to make the course material as helpful as possible through collaboration with peers in industry and conferences and determine whether the problems I see running software are as substantial as I imagine. Lastly, there are threats to this project, including a minimal period and the parallel requirement to find a new job to start in December or January.

With the SWOT analysis, we have a good index of the organisational capabilities. This is great! The next is to start thinking about the actual strategy.

💡 You can try this in your workplace. You can do it alone or with a team; take a moment and figure out what strengths, weaknesses, opportunities, and threats exist within your organisation. Once you’ve written this up, overlay this with your current strategy or the work you’ve been doing over the last three months. Are you maximising your strengths? Overcoming your weaknesses? Are there opportunities in the next weeks you can take advantage of? Challenges you should be preparing yourself for?

Story #2: Site Reliability Engineering

I’ve seen a lot of different “strategy documents”. They can vary widely in their ability to empower an organisation to solve their core problem, but one example sticks out in my mind: The formation of the Site Reliability Engineering department. I want to share this story as it was especially pivotal for my understanding, and I want you to understand the power of a good strategy.

In 2019, the organisation at which I was working did a reorganisation of central technology teams. It brought a series of teams together, including the “incident management” team, the “logging” team, the “SRE enablement” team and the “visibility” team, all under a leader called “Luis Mineiro”. The reorganisation made sense — the teams were all oriented around ensuring engineers could understand and operate their software. The teams had worked together in the past but had very different ideas as to how the future for their teams should look like and little notion as to how they should work together for a shared purpose.

Enter: “The Site Reliability Engineering Strategy 2020”. Luis wrote this document in collaboration with the engineering managers of each of those teams and created a shared understanding of the vision and purpose of the department. It rallied the teams together to make their work substantially more cohesive and clarify what the department would do and what it would not. Consequently, the department was able to ship remarkable changes quickly, restructuring incident response, rebuilding the time-series system and delivering a new model for thinking about time-series data.

The question that didn’t occur to me until I created my team was why this strategy was so successful. Why did engineering managers buy in so heavily? Where did it enable decisions, and how? Unfortunately, I subsequently left the company and cannot read it to be sure!

Earlier, we outlined the fundamental problem a strategy needs to solve: discovering the critical factors in a situation and designing a way of focusing actions to deal with those factors. There are a few things we need to have to complete that design. Let’s get out a document and start making notes!

The Problem

The first is a shared understanding of the problem. It might be surprising to think about this — after all, we’re all here for the same purpose! However, people can have different perspectives on what we’re trying to achieve and why. Given this, the first thing to do is to summarise the problem we’ll be working to solve. The customers who have this problem, what that problem is, how it affects that customer and what the scope of the problem we’re considering is!

Let’s stick with the course example. We might start with, “We need to teach engineers site reliability engineering”. That feels like a problem, but in fact, it is only the absence of the solution we intend to build. Which engineers have this problem? How many of them are there? What is their experience? Why does this work even matter? Let’s rephrase the problem a little. “All software companies need to ensure an acceptable level of reliability, and there are effective mechanisms to define that acceptable level and make tradeoffs to ensure that it is met. However, these mechanisms are not widely known, and companies must discover them independently. Because engineers are developing these approaches in parallel, we end up with competing approaches that are inefficient or a solution that comes at the cost of a greater business outcome.”

Now, we have a clearer picture of why our work matters and a more extensive range to draw solutions from. Courses are one approach, but so might be building a product to commodify knowledge or defining a standard organisations should adopt. It also defines what we will not solve — we will not solve the definition of business requirements, analytics or other software requirements, just the reliability tradeoff.

Success Criteria

As a corollary to the problem, we need to know how we will progress against this problem. It should be something that does not limit the potential solutions or tolerate changes in approach as we figure out a more efficient way to solve our problem.

Let’s stick with the course example. Our fundamental goal is to help organisations improve reliability, so let's try and measure that! Unfortunately, we are unlikely to maintain access to organisations' data directly, but we can certainly ask them three months after completing the course whether their reliability has improved. In addition, as we intend to teach learners, we can ask them three months after completing this material whether or not it helped.

Once we have a way of validating our strategy, we need to set up a routine to check whether or not we’re delivering what we need to. A good default is to check in on these monthly, comparing the current the past month and if possible, the same position last year.

💡 Try this in your workplace. Write up your understanding of the problem your organisation is designed to solve and how you’re validating that it's solved. If you have a patient colleague, ask them to write theirs down in parallel. It doesn’t have to be long — 100 words is great! Then, compare. Did they overlap? What were the differences? What happens when you share it with your team?

organisational Theories

So far, we have an analysis of our capabilities, a clear problem definition and the criteria by which we’re going to measure if we’re making progress. The next step is to write up our theory as to what might be contributing to the problem, and what might be effective interventions to try and address it. It's the basis on which we’ll make “strategic decisions” later.

Sticking with our course example, the problem we want to solve is that organisations need to maintain an amount of reliability, but there’s no common path to do that. Engineers need to discover the solutions to this problem on their own. Some theories that might contribute to this include:

Giving engineers a good operations model will improve reliability. If the engineers know how to engineer reliable systems, they’ll choose to do so.
Multiple colleagues doing the same training are more likely than one colleague to drive change. Team members collaborate on new ideas much more efficiently if they’ve learned that new idea at the same time, and can collectively discuss how to implement it.
Providing a shared reliability model means a path to shared evolution. Rather than every engineer discovering operations for themselves, we can give them all a “standard reference”. Even if they disagree, they have something explicit to disagree with.

These theories will either be correct or not. But by making them explicit, we can be clear about why we’re making the strategic decisions that we will. Additionally, we can work to validate these theories as we execute the strategy, abandoning the theories that do not match reality (or abandoning the strategy if the theories are sufficiently incorrect).

We need to review these theories for evidence; a good model to do this is once every six months.

Capabilities & Constraints

We have our problem, measurement and organisational theories. We now have as much opportunity to be creative as we can be! Unfortunately, while it is fun to be boundlessly creative, we are all bound bythe capabilities we have access to or the constraints we need to meet — the same ones from the SWOT analysis earlier.

We should document them in the strategy unless the SWOT document is otherwise especially legible. We do not have to elaborate on them. We can simply restate them from the SWOT:

Capability: An instructor with over ten years of experience in writing software and maintaining and operating that software in a production environment
Capability: A series of social media profiles that can provide snacks of this content, with the hope people will choose to purchase the whole material.
Constraint: The time to write the courses is limited, with only 12 weeks (between September 1st and November 30th) available.

This is useful to reevaluate as we evaluate the rest of our strategy in 6, 12 or 24 months. If we develop new capabilities or meet new constraints, we can factor them in and adjust our strategic choices.

Tradeoffs

Wheow! We’ve done much investigative work so far. We’ve done an analysis of our strengths and weaknesses, defined and aligned on the problem and how we’ll measure it and created some theories around why that problem exists. We’ve been clear about our opportunities and constraints. It’s time to make some decisions.

Let’s stick with our course example. We have a hard constraint of time. We’re thus making a broad choice as to how to spend that time. Our first tradeoff might be:

We will focus on the delivery of limited material for rapid feedback with our community instead of publishing a large release at once. We will revise and expand content that appears especially helpful to learners.

In and of itself, the tradeoff is not a decision — rather, it's something that we should consider as we’re making other decisions. Other tradeoffs might be:

We will focus on providing value by creating compelling material, rather than on the delivery of that material itself. We will work actively with third-party providers where they provide a suitable mechanism to deliver.
We will focus on a limited subset of the material if that proves especially valuable rather than aim for material completeness

I would expect tradeoffs to be much longer lived than any decisions that we make as part of a strategy. Tradeoffs only tend to be adjusted when the underlying conditions of the organisation change, such as the time horizon on which the organisation is focused or the core problem the organisation is faced with changes.

💡 Try this in your workplace. All organisations are making tradeoffs all the time. See if you can look over several major decisions leaders have made in your organisation. Is there a common theme? Can you write down this theme and validate it with your peers? Can you figure out whether that tradeoff is by design or accidental?

Constraining Choices

Hooray! We’re about to make some strategic choices — the most exciting part of any strategy. However, this is also easily the most challenging part of a strategy to execute well. The primary challenge is not the decisions themselves but figuring out how to layer decisions to give an organisation direction but does not constrain its creativity.

Let’s think first about about a team of people. The value in dividing an organisation into “teams” is that each can make a series of decisions with some independence from other teams; that’s how the organisation can get a lot done quickly. Let’s think about an organisation of 100 people, organised into “two-pizza” teams of four to 5 people. There might be 15 delivery teams of 5 people and a manager. Those managers will need to be managed by an additional three managers (which we’ll call “department heads”), which, in turn, need another manager (“director”). We might have six senior technical team members (“staff engineers”) and one project manager.

That means we have at least 19 managers who need to make coordinated decisions across multiple layers of management, without counting any technical experts who inform these decisions or teams who are expected to own and deliver on them! There is a tradeoff between an organisation that is aligned and an organisation that is fast-moving. Our goal is to create a strategy that allows an organisation to be as aligned as necessary while moving as fast as necessary.

The trick is to make decisions that constrain choices just enough to ensure alignment but that allow the maximal creativity for the next management layer to take. This can be extremely challenging and requires balancing freedom and accountability in the management layer.

For the example of the course, it’s a little easier — it’s just me writing these courses! As organisations get larger and larger, the balance gets harder and harder to strike. The course decisions include:

Provide multiple, small courses teaching Site Reliability Engineering. This enables us to validate our organisational theories with relatively minimal work and meets the tradeoff of delivering a limited amount of material for rapid feedback.
Provide learners connections with communities of practice. This enables us to validate the theory that multiple learners executing in parallel will be more effective change agents than a single learner — even if those learners are not in the same organisation.

Thats it! We can figure out anything more specific than that over the coming weeks.

Try this in your workplace. See if you can figure out how your top-level leader balances autonomy and alignment in your management chain. At what layers are which decisions made? What are the controls that keep things aligned? Are there multiple layered strategies?

Review & FAQ

Gosh, wasn’t that a blast! We’ve now completed our strategy.

Unfortunately, the work around the strategy has only begun. A strategy is only ever solid if it prompts a shared understanding that you, as a leader, can use to drive organisational efficiency. The first step to doing this is getting it in front of people.

The first and counterintuitive step is to use yourself as the first reviewer. To start, put the strategy away. Hide the tab, put it in a draw, and rename it opaque. Then, a week later, you come back. Print it out, get a marker and start going through it. You’ll be surprised at how much you see that you have already evolved your thinking. Go through and correct it. Then, get a recorder, read the document aloud and then play it back to yourself going for a walk or similar. As before, you’ll notice things that you missed.

The next challenge is getting people to read it. Mostly, people will only read some way through a more extended strategy! Reduce your larger document to a 1-page strategy summary to get the entire buy-in. Once you’ve written this, start sharing it for others to read. Share it with

A close teammate
Your leader
Your team
Your stakeholders
Your organisation

As you share it with more and more people, you’ll start to find that common questions come up. You can answer these questions in a new section called “FAQ” at the bottom of the document. Common questions include:

Who are the customers of this strategy?
What gives you confidence in this strategy? What are the risks?

After you’ve distributed your strategy, call a meeting and invite whoever needs to approve the document. Allow your approvers one final read, ask for approval, and then you’re aligned once you have it. Time to start executing!

In Summary,

Disambiguating the unknown can be extremely challenging. I struggled with it when my manager tasked me with driving the Embedded SRE team. I first saw it solved elegantly during the creation of the Site Reliability Engineering department. However, taking a structured approach to understanding the problem is very helpful.

SWOT allows us to inventory our organisation's capabilities. Then, we can go through and ensure we have a shared understanding of the problem and theory as to why it exists, list our capabilities and constraints, make tradeoffs and then define the constraining choices.

Once we’ve written it up, we can condense it into a one-pager and share it with our colleagues. We can then use that to walk our now clear path into the unknown.

I recommend you try it! At worst, you’ll learn a tonne.

Reading Lists

Andrew howden — Mon, 28 Aug 2023 15:49:52 GMT

💡 denotes a uniquely helpful read. In no order in particular.

Books

Good Strategy Bad Strategy: The difference and why it matters — Richard Ruemelt
The Principles of Product Development Flow: Second Edition — Donald G. Rienertsen
High-Performance Web Sites — Steve Souders
💡 High-performance browser networking — Illya Grigorik
💡 Drift into Failure — Sidney Dekker
Linux iptables Pocket Reference — Gregor N. Purdy
The Five Dysfunctions of a Team — Patrick Lencioni
Six Simple Rules — Yves Morieux, Peter Tollman
💡 Thinking in Systems — Donella Meadows
Systems Thinking for Social Change — David Peter Stroh
Building Microservices — Sam Newman
💡 Accelerate — Nichol Forsgren, Jes Humble and Gene Kim
The Pheonix Project — Gene Kim, Kevin Behr, George Spafford
Grit: The power of passion and perseverance — Angela Duckworth
💡 The Go Programming language — Alana A. A. Donovan, Brian W. Kernighan
Infrastructure as Code — Kief Morris
Work Rules — Laslo Bock
How Google Works — Eric Schmidt & Jonathan Rosenberg
💡 Site Reliability Engineering — Betsey Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy
The Manager Path — Camille Fornier
The 5th Discipline — Peter M. Senge
The Unicorn Project — Gene Kim
Attacking Network Protocols — James Forshaw
The Site Reliability Workbook — Betsey Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara & Stephen Thorne
💡 Working Backwards; Insights, Stories & Secrets from Inside Amazon — Colin Brayar, Bill Carr
Continuous delivery — Jez Humble, David Farley
💡 Thinking in Bets — Anne Duke
The Culture Map — Erin Meyer
Useful to understand how different perspectives around the world can help implicitly shape understandings, and lead to quite different views on the same communication.
Scarcity — Sendhil Mullainathan, Eldar Shafir
Useful to provide good models of how resource-poor people (e.g. time, money, etc) bias themselves in a way that systematically leads to poor outcomes.

My Radical Idea: How to be a Site Reliability Engineer

Andrew howden — Sat, 26 Aug 2023 08:50:57 GMT

Listen now (8 mins) | I want to pass on whatever I've learned about SRE. I need your help to make sure its useful

My radical idea: How to be a Site Reliability Engineer

Andrew howden — Wed, 23 Aug 2023 08:00:22 GMT

You can also find this on YouTube

The backstory

One of the more exciting parts of my job has always been sharing an amount of experience I’ve either been able to gain first-hand or composite together from blogs, research and other people’s insights. I do this quite a lot internally wherever I’m working, and occasionally try and share this with the broader community (e.g. SLOConf, SRECon, Increment and so on)

As Covid-19 hit, the opportunities to go to conferences and meet new colleagues in person were (rightfully) dramatically limited, and we waited patiently for the world to reopen. In the meantime, I started playing with video equipment. This sponsored videos for my org about being on-call, observability and distributed tracing. The videos seemed to help those new to these topics and became part of the standard Zalando onboarding process.

I got to do this a few times, but the videos were expensive to make — about 100 hours each. As you can imagine, there are many demands on engineers’ time and making videos was not one of them ;D Still, I felt like we were under-leveraging the innovation and capabilities of our engineering community. If we could better empower them with some of our senior community's hard-won knowledge, we could make our environment more productive, healthier and happier. It sat in the back of my mind — two years, an itch that never went away. I kept hacking on making prettier and prettier video content. While I am no YouTube sensation, I don’t think there’s much else I could do without focus.

The inflection point

My work in the intervening period has been tremendously satisfying. I believe that I and the peers I worked with substantially improved the customer experience and the quality of life of our engineering community. However, recently due to some organizational restructuring, I faced a series of choices — a moment of inflection. I could do what I’ve done in the years previously, or I could try and scratch that itch.

I talked to my wife, colleagues and peers, and after some challenging reflection, I decided to try and scratch that itch. I’m going to try and make an SRE course.

I start next week. I have between now (August 28th) and November 30th to deliver. My work with my previous employer has ended, leaving me with no focus except to execute this course. By December 1st, I must have found and started on a new opportunity.

There’s no other opportunity that will present itself like this one. Let’s see what we can do!

The material

The course aims to empower engineers in their career between a “software engineer” and a “staff engineer” to understand how to make a reliable customer experience without sacrificing the health and well-being of the engineers who strive to make that experience work. After doing the course, I would expect learners to be able to:

Know what "Site Reliability Engineering" is, the areas that it covers and the broad approach it takes
Be able to make software “observable”
Be able to deploy software to "the cloud"
Understand how to detect and respond to critical production issues confidently and collaboratively
Understand the different operational tradeoffs (money, on-call health, developer velocity and reliability)
Know different strategies to consult with product delivery teams

To do the course, you will need some software development experience. Where the course makes specific references to a language, I’ll be using Go. We’ll cover a broad range of topics, such as how:

Do systems break?
Can we design systems?
Do we validate our software before release?
Do we release our software safely?
Can we modify our software safely at runtime with configuration?
Do we understand what our system is doing in production?
Do we respond to production incidents? Learn from them afterwards?
Can we articulate and set guidance on operational tradeoffs?
Can we work with other teams specifically on reliability topics?

The goal is to cover as much of this material as I can within 8 hours of your time (or less), making sure you get as much value as possible. So far, I intend to publish it on Udemy for the company-friendly price of 199 €.

Making this a reality

I am genuinely excited about seeing if I can make this work, but there’s still an enormous amount of challenge ahead. I’m going to need some help to make this a success.

There are a few ways you could help if you so desire:

Feedback

You’ll notice that recently I’ve been posting material on this blog, on YouTube, on LinkedIn, and so forth. This is practice material — I am learning what I must do to make a compelling course. I will continue to do this over the next 12 weeks (and hopefully beyond), but I need feedback from you most here. Things like:

Is it valuable content for you? If yes, why? If not, why not?
Are you able to follow my line of thinking? Are there ways I could make this more straightforward for you to understand?
Is the way it’s presented useful? Do you prefer written content, video content, shorts or anything else I haven’t thought of?
Is the content accurate and comprehensive? Am I missing anything?

You should comment on the material directly wherever you see it. You should also be very direct. You will be doing me a favour; the internet is an unforgiving place, and I need your perspectives early to help me evolve ♥

Engagement

If you find the content interesting, let’s chat about it! Ask questions, raise comments, disagree or otherwise engage with the material.

This solves two substantial problems:

It will help me connect with you. As I make this transition, these connections are just vital for me.
I will understand you more. The more I understand you, the better I can craft helpful content.

Buy in

Lastly, if you find it useful, subscribe. When the course comes out, buy it. Ultimately, this will determine whether I will make more of this material or focus on other areas.

As a bonus, I will publish a coupon code for subscribers of this mailing list that will give you a substantial discount in case your org does not give this much love to learning and development.

Let’s get to it

This is an urge I’ve had for a good couple of years now. The moment has come when I have the opportunity to pursue it, and while it is terrifying, I am committed to delivering on it. The material will cover everything designed to help you engineer reliable systems or drive reliability commitments within your organization.

I won’t be able to do it alone, and I will frequently stumble along the way. Your help will be vital to making this a success. You can do this simply by giving me feedback, engaging with the material or buying in when you see the opportunity.

The project ahead feels like an enormous mountain to scale. That said, the mountain gets no smaller, and we get nowhere by thinking about how to scale it. Let’s get to it!

Incident Commander

Andrew howden — Sun, 20 Aug 2023 14:29:31 GMT

Listen now | What an incident commander is, they need to know, what they do and how to prepare for the role.

Incident Commander

Andrew howden — Mon, 14 Aug 2023 17:03:18 GMT

💡 You can also listen to this post via a podcast or find it on YouTube with some handy visuals.

I’ve been part of the “incident commander” at a large, multi-national European eCommerce fashion company for the last couple of years. Through this, and my time as a Site Reliability Engineer, I have been exposed to numerous incidents, from the occasional “… this is not an incident. We can just deal with this tomorrow” to the “we need the CTO on the phone right now”. It’s a hugely exciting role that I would encourage peers to consider as part of their personal development.

To that end, let’s go into what an “Incident Commander” is, their roles and responsibilities, how to prepare for the role, and what the role looks like in practice.

Incident Response

Rather than dig into the incident response details, I’d encourage you to read the article “I hereby declare this an incident” and “Help! I’m now on call!”. They go through this in substantially more detail.

For incident commanders, a few things are worth calling out especially. These include Severity, Stakeholders and Maturity.

Severity

The severities that we see as an incident commander can vary pretty widely; everything from “someone deployed something bad and should roll it back” to “this is going to appear on the news tomorrow, and we’d better wake up the crisis management team to prepare the required press releases”. We’re usually the most senior stakeholder in the room — indeed, the most empowered unless a VP or CTO turns up — and we thus need to make a set of decisions factoring in the current level of impact, the risk of future impact and the likelihood of recovery.

To that end, we need to understand the significance of our issue quickly.

Stakeholders

On paper, only a few people are involved in incident response — the responder fixing the issue, the scribe making notes and you as the incident commander.

In practice, as the issue gets more significant, many stakeholders turn up. We have:

Management is looking for clarity on the current issue and to make it clear that it deserves the focus of all people required. Also, to offer help and coordination.
Sideline Experts who are watching the incident on a public channel or otherwise doing off-book investigation and who have learned something they believe is beneficial.
Product Owners who are suddenly powerless to do their work and are watching their hard-earned customer trust wash away.
Non Experts are trying to help by supplying additional information or reporting the issue in parallel without understanding the existing context.

Each of these stakeholders requires a level of coordination to ensure they either can contribute to or at least do not hinder the response.

Maturity

In principle, an organisation optimizes to ensure things are available. They usually have a group of people tasked with this responsibility and allocate those people the time and effort required to practice and improve their response.

In practice, an organization comprises a diverse set of people, each of whom has a unique set of expertise and capabilities. Sometimes those capabilities are less developed than we might expect, given a person’s assigned responsibility as an on-call engineer.

They might:

Freeze. Due to an incident's sheer stress, a responder might not think clearly and either freeze entirely or get stuck hyperfocused on an irrelevant part of the investigation. This is understandably exaggerated if that responder also triggered the incident.
Offload Responsibility. A responder might review their systems, conclude nothing is wrong, and thus, it’s someone else problem. They’ll involve them and then leave. This is understandable but unacceptable during response — information usually evolves too quickly to step away before it’s repaired.
Not know. A responder might not know how to debug a system, where a systems telemetry is or what the consequence of a given decision might be for the business.

These are all surprising during the process of a response, but over a long enough time happen predictably. It’s part of an incident commander’s mandate to try and help responders overcome these challenges.

Role

So, an incident has happened, and the incident commander has been paged. They’re reading the chat and understand the severity of an incident, the stakeholders that are appearing and the maturity of those involved. What’s their actual role in this mess?

An incident commander is tasked with structuring the response to minimise the impact on the business. The practicalities of this include:

Structuring Communication

An incident is a high-stress moment with many people coordinating around a single issue. Frequently, people start to take shortcuts with their communication, and messages become things like:

A: flop system is broke
B: broke?
A: yeah broke it doesn’t work anymore
C: A what does the graph say
B: it works for me
A: no its broke

Or similarly vague requests. With stressed people, this quickly deteriorates and can end up with people either getting “snippy” with each other, making demands or working in parallel and not talking. As more and more people join the chat, it becomes unworkable, and we lose valuable time just understanding each other rather than analyzing the problem to intervene.

As incident commanders, we need to intervene and clearly set communication expectations. Often we can do this simply by reading through the thread and articulating everything that’s happened so far in a status message. For example,

IC: So far, I understand the customer experience impact is

* flop system

Our underlying hypothesis of this are:

* we’re out of disk space (being verified by A)
* we’re overloading CPU (being verified by B)

at this time, our estimated time to recovery could be up to 1hr and is very unpredictable. Please add a ➕ is this is correct

Such a message clearly sets the expectation for how communication should be and combines a bunch of communication happening in bits between stakeholders piecemeal.

This can quiet the thread until there’s a discovery or intervention. Whenever there is, it might still appear piecemeal:

A: disk is full

We can tidy this up by both setting the expectation about comms and clarifying the actual data:

IC: A, I am struggling to understand your message as it does not contain the context I need. Do you mean that the disk with the ID a-service-disk has zero free space left according to the graph “https://…”
A: Yep sorry will add this in future

After which the communication usually improves.

Making Decisions

As the incident response goes on, there often comes a point when a responder can intervene, but that intervention comes with some risk. For example, they might disable a payment method even though it is used by 50% of users in that country, even if it is broken, or they might need to make many products unavailable while data is repaired. The responder might be able to supply some information about the consequences but does not feel sufficiently empowered to make that call.

An incident commander is empowered to make these decisions. They need to balance the impact on the customer experience, the customer experience given the alternative and the impact on the business. They need to assess whether other, better-informed stakeholders are available to make the decision or whether it is better to make it sooner. Ultimately, they need to provide a path forward for the responder.

There are no correct solutions here, but commanders are expected to know enough to make reasonable decisions the majority of the time.

Managing Stakeholders

As earlier mentioned, many stakeholders will turn up during a response — both those who are supposed to and those who are trying to be helpful. The problem is those stakeholders can start to distract those that are actually repairing the issue, either with requests for clarification (usually management) or helpful ideas about restoring the service (other engineers)

To repair a system — especially if the underlying failure is non-intuitive — the responders need to be able to focus. This means that while the messages addressed to responders are well-intentioned, they are also degrading the response.

As incident commanders, we need to take away as much of the burden of communication from responders as possible and work to manage the expectations of those trying to contribute with only the information we have. We can encourage people to send technical insights they have to us privately to keep them out of the main thread, and if we catch one that is especially useful, ask the contributor to surface it in the main thread and join the response.

We also need to communicate proactively with non-technical stakeholders, translating the technical findings of the incident so far into information that these stakeholders can take action on. This could be as simple as clearly conveying the current customer experience, giving an estimation of time to recovery or simply reassuring people that the response is happening and just to be patient.

Manage large incidents

Occasionally some incidents are so substantial they require multiple parallel efforts to restore system functionality. This could be a whole team executing a repetitive task (e.g. repairing DNS records) or many different teams figuring out how to repair their service (e.g. shifting many different services out of a broken availability zone).

During such a response, it is up to the incident commander to survey the available people and then to task either specific people or teams with people with tasks that they should complete. The incident commander should also set up an ad-hoc management process — often just a Google Sheet or Doc — to keep track of these tasks and identify any people who are stuck or need intervention.

This can take substantial mental effort. The critical thing in these cases is to maintain an overview of the current impact, hypothesis, interventions and people doing tasks. This means that some of the incident commander's responsibility should be delegated to other incident commanders or senior colleagues. For example, managing communications, communicating with management or updating sheets can all be delegated to someone else while the incident commander maintains an overview.

Kick-off crisis management

There are occasions when a given technical issue has such substantial ramifications that it will mean the company either loses customer trust or money or appears in the news. All of this is beyond the usual remit of incident response, but there are parts of the company that are designed to cope with these challenges.

An incident commander should kick off these crisis management teams and empower them with what they need to control the narrative around what is happening with the system. Occasionally, these teams will come back with requests (e.g. “Can we put up a notice here” or “Can we identify customers to apologize and send a voucher”); the commander has to prioritize and implement these requests against the backdrop of the incident.

Off the beaten path

Lastly, there are occasions when something happens simply that no one planned for. Still, it is sufficiently urgent that it bypasses all software deliverables and is worth as much effort as we can bring to bear on it. One recent example is security issues that affect many systems (e.g. log4j) and are outside the normal security response process.

Incident Commanders are frequently involved in these incidents for their experience managing such urgent tasks, relationships with responders, and credibility within the community. While the incident might not follow the traditional process, the commander can still help deliver a critical and immediate business requirement.

Preparation

What characterises the incident commander role (at least, in my experience) is that it tends to be involved in issues that are outside the normal processes. This makes the role challenging, as it would be “preparing for the unexpected”, and the unexpected is … well, unexpected.

That said, there are things that incident commanders should have:

Practice

An incident response process is generally designed to empower responders to prioritize, communicate and respond to production issues, but it also usually doesn’t say much about the response. The response depends on the specific technical stack, the business impact, the stakeholders and other organizational contexts.

The only way to understand what responders are going through is to be a responder. Because incident commanders are usually only called for issues more critical than the “average” incident, they should have experience with similarly critical incidents.

This allows them to build empathy with responders as they go through their most challenging professional experience so far and build a toolkit to understand the impact, manage these responders, or otherwise coordinate incidents.

Once they have that experience, they should join the incident commander rotation in the shadow role to get perspective on the other side of that responsibility and build the required business context and relationships.

A broad business understanding

As mentioned earlier, incident commanders will tend to be involved in more substantial issues, and occasionally, they need to make time-sensitive decisions that can broadly impact the customer experience.

They can only do that if they also understand the business, the customer experience, and the software architecture and can make a judgement call as to which is the better technical path to follow.

Gaining a better understanding of the business is dedicating time to learning it, reading each domain's top-level and significant strategies, and then understanding the approximate software architecture. It will evolve, but a business’s core deliverables are stable over many years.

Good stakeholder relationships

Lastly, as the incident commander interacts with a broad range of stakeholders across a broad range of job roles, they must communicate in ways that suit each stakeholder. Additionally, they have (hopefully) established this communication outside the bounds of the incident itself. Good relationships allow much of the communication to be implicit, and stakeholders are more likely to trust the incident commander’s judgement until it can be reviewed.

As a pro tip, communication does not necessarily have to be two-way — being seen as a visible, technical expert with good communication skills makes life easier.

Debugging Skills

Naively, an incident commander is also expected to have excellent debugging skills and be able to reason through the behaviour of a broad range of systems.

If you are debugging an incident yourself during a response, the incident is in deep trouble.

That’s not to say it doesn’t happen — it does, every so often — but an incident commander's core value is in enabling others' excellent work rather than being a technical expert.

Where they are needed, skills in reasoning through the system in first principles (e.g. as a system of constrained resources, as a series of queues or in its interactions with the kernel) are most helpful in understanding a broad range of systems, runtimes and architectures. After that, reading the internals of different runtimes — Java, Scala, Go, Node and so on are all useful as they frequently have excellent debugging capabilities that responders didn’t know, as they’d never needed to go that deep.

In the end, all just bits flowing through the network get run on several cores and, occasionally, written to disk.

In Summary

Incident Commander is an exciting role. It exists primarily to catalyze response rather than contribute significantly to anything about the response. This means they must understand the severity, stakeholders and maturity of responders and work by structuring communication, making decisions, managing stakeholders, significant incidents or anything “off the beaten path”.

There’s no secret to preparation, save taking the time to practice — especially in incidents of the same severity that incident commanders are regularly involved in. This practice and developing a broad business understanding, stakeholder relationships and debugging skills all go a long way to making an effective incident commander.

Hopefully, this reassures you that incident commanders are, in fact, human and that, with time and effort, you, too, can join their ranks!

The elegant, but unspoken solution

Andrew howden — Mon, 07 Aug 2023 18:36:51 GMT

So far, throughout my career, I’ve been a software engineer, systems engineer, site reliability engineer, principal engineer and finally, engineering manager. I’ve built new user interfaces, checkouts, ansible definitions, and Kubernetes clusters and, more recently, been embedded in a large organisation in a team dedicated to improving the reliability of the checkout experience.

This sounds very impressive, but let me assure you — it’s a journey that’s been absolutely littered with failure. As a software engineer, I broke whole shops; as a systems engineer, I deleted critical data. As a site reliability engineer in a major production issue, I’ve been wrong, with strong insistence, while millions of dollars slipped away. I’ve caught Bitcoin miners running around my (non-production) systems, and I’ve soured relationships with colleagues. All of this is to say I’ve struggled quite a lot. As Site Reliability Engineer, I’ve also been exposed to many other teams' struggles.

Over time, I’ve developed a preoccupation with failure. This has led to the study of failure, which unexpectedly led to the study of success. There have been invaluable lessons here, but I want to discuss one on working with people.

On the value of engineering

Software development work's fundamental value is making users' lives easier. There are different ways to do this, but in my experience, the most remarkable deliverables result from some unique insight that understands the customer perspective, the product vision and the technology and combines them in some new and innovative way. Understanding (for example) that when buying clothes online, they struggle to understand their sizes but that recent gains in computer vision can be leveraged to make that easier. Or that customers are happy to watch a subset of content instantly if they do not have to leave the house.

To make these more immense capabilities available to customers, there still needs to be a lot of work done, but in each deliverable, there is often an insight that allows a lot of work to be done more efficiently, rather than deploying on Kubernetes, deploying on AppEngine. Rather than using Java, use Go or Python. Rather than using Redis as a document store, use DynamoDB. Small efficiencies lead some teams to substantially outperform others in delivering that value.

Insight comes from a few people with a large amount of context and the ability to combine that context in new and exciting ways — often in discussion with others. However, there can be a bit of a gap between how these colleagues gain and leverage their insight and how their leadership views the same work.

The decision-maker disconnect

Within any given organisation, there is frequently a set of people who hold substantially more power than others. They can be more formally allocated that power (for example, engineering management), or they can have it due to expertise or reputation (staff or site reliability engineering). Because of that power, these colleagues can either make or strongly influence decisions directly.

These colleagues all have some notion as to how the organisation works. They might have either explicitly prescribed it or might be imagining it. They use this understanding as a precursor on which to make decisions.

One of the more surprising things from the “human actors” research is just how disconnected a decision-makers model of how the org work is from how it works. Sometimes, the overlap between how the leader thinks the organisation gets work done and how it gets delivered is minimal! In practice, there are four ways to view “work”:

Work as imagined
Work as prescribed
Work as disclosed
Work as done

Only the last case is how work actually gets delivered.

Work as (not) disclosed

One of the challenges that decision makers or leaders within an organisation will face is the power dynamic between a decision maker (a “high power” person) and the person taken with executing that decision (a “low power” person) means that lower power colleagues can have a much more comprehensive range and more significant set of adverse consequences if they challenge that high power person. This could be as simple as losing esteem in the eyes of the decision maker or as complex as being criticized by that decision maker for disagreeing with their perspective.

As colleagues on the execution side of decisions deal with a broader range of leaders, they tend to encounter leaders who are more punitive in their approach and thus start to tailor their information to minimise the chance of that leader being unhappy. The more extreme the power differential between the decision maker and the executor, the more likely an executor will tailor their information to benefit that decision maker.

For an organisation that is predicated on insight, this can be disastrous. At best, a colleague may not contribute their ideas on how to solve a company goal. Still, at worst, the colleague will contort themselves into agreeing with the decision maker. The colleague views the work through a highly optimistic lens (“if everything goes right, I can make this work”), which, given that everything invariably doesn’t, leads to missed expectations and general unhappiness.

Safety in adversity

In his article “Why do things go right”, Sydney Dekker highlights several critical properties of organisations that are more successful:

Diversity of opinion and the ability to voice dissent
Keeping a discussion on risk alive
Deference to expertise
Ability to say stop
Broken down barriers between hierarchies and departments
Not waiting on audits or inspections to improve
Pride in workmanship

The challenge of any decision-maker (or anyone responsible for the management of people) is to try and figure out how to encourage diversity, dissent, discussions of risk, the ability to say stop and so on. The decision maker must first make the environment psychologically safe to promote these behaviours.

Psychological safety is:

the belief that you won’t be punished or humiliated for speaking up with ideas, questions, concerns, or mistakes. At work, it’s a shared expectation held by members of a team that teammates will not embarrass, reject, or punish them for sharing ideas, taking risks, or soliciting feedback.
— Center for Creative Leadership

Leaders within an organisation influence psychological safety through their actions. Whether by encouraging discussion or punishing dissent, leaders set the tone for what is tolerable within their organisations. Correspondingly, this leads to those insightful engineering outcomes — or late projects.

A safe space

In his article “Psychological Safety in operations teams”, John Looney recommends concrete ways leaders can make their organisations psychologically safer. These include:

Creating space for people to take chances
Making it obvious when the team is doing well
Making your communication clear and your expectations explicit
Making your teams feel safe

In my experience, the most critical habit here is “listening with the intent to understand”. I’ll ask questions about how colleagues feel about their work in 1:1s, their primary challenges, and how they feel it fits into the strategic whole. I’ll listen, try and restate their perspective and ask them to confirm it before we proceed. Then, I’ll try to answer that question using their language and reconcile it against what I’ve seen and what other stakeholders might be considering. This gives them a much larger context to operate with and the ability to ask further clarifying questions.

In broader environments, we can set this expectation through example, asking questions that are deliberately naive so as to set the expectation other questions also have this capability. We can ask for input from our colleagues, with the explicit direction we’re asking them because of their perspective — even if they do not share (or even fully understand) what else is happening. We can listen and be kind in our interactions.

Lastly, there are environments in which more conversation is inherently permissible. Going for a walk, meeting at a conference or meetup, having a glass of wine after work or deliberately manufacturing a different context to discuss a topic can make it easier for people to take more risks.

By taking these actions, decision-makers can encourage their executors to take more chances when proposing, discussing or contributing to a group discussion. More discussion allows a greater diversity of opinion and dissent and the ability to recognise and surface expertise. This, in turn, will enable us to manufacture insight.

Safety with limits

The only caveat to an organisation that encourages dissent is that such an organisation, with controls around decision-making processes, can avoid getting stuck in analysis paralysis. In my experience, while decisions should be discussed freely, there comes a point when a leader needs to make and own a decision.

This is made a much smoother process if there is a way of retrospecting on previous decisions made, as well as learning whether or not there was a different decision that could be made in future. This allows dissenting colleagues to “disagree and commit” and for either that dissenting colleague or others within the organisation to learn from the results of previous decisions.

In Conclusion

My career so far has been varied but pockmarked with failure. Most recently, I’ve made failure my study of choice, and through this, I have encountered the analysis of success. This has led me to understand the value of engineering being fundamentally measured in customers' happiness, which is most easily found through some unique insight into an existing set of problems and technologies that can be applied in a new way.

This can be challenging to implement within organisations as decision-makers can become disconnected from those that execute their decisions. The power dynamic within an organisation means that unless they’re invested, those leaders might never discover how what they expect doesn’t bear up against reality, and what happens is much more fraught with risk. Some organisations are routinely more successful, and those organisations prioritise discussions, have diverse opinions and allow dissent. To turn our organisations into these successful examples, we need to cultivate an environment of psychological safety that will enable colleagues to raise their diverse opinions or participate in discussions.

If we do this, we’ll have a much larger potential space to find our fundamentally elegant insights and a group of peers able to understand and execute them, delivering on that most valuable improved customer experience.

Woof! That one was hard to write. Find it useful? Great! Maybe someone else will. You can help them out by sharing it with them!

What is a container?

Andrew howden — Thu, 03 Aug 2023 09:17:39 GMT

Editors Note: I initially published this in 2019, but subsequently tore the property hosting it down. I wanted to share it with a colleague, so I’m repositing it. It might be slightly out of date.

Containers have recently become a common way of packaging, deploying and running software across various machines in various environments. With the initial release of Docker in March 2013^[¹^], containers have become ubiquitous in modern software deployment, with 71% of Fortune 100 companies running it in some capacity^[²^]. Containers can be used for:

Running user-facing, production software
Running a software development environment
Compiling software with its dependencies in a sandbox
Analysing the behaviour of software within a sandbox

Like their namesake in the shipping industry, containers are designed to easily "lift and shift" software to different environments and execute that software similarly across those environments.

Containers have thus earned their place in the modern software development toolkit. However, to understand how container technology fits into our modern software architecture, it’s worth understanding how we arrived at containers, as well as how they work.

History

The "birth" of containers was denoted by Bryan Cantrill as March 18th, 1982^[³^], with the addition of the chroot syscall in BSD. From the FreeBSD website^[⁴^]:

According to the SCCS logs, the chroot call was added by Bill Joy on March 18, 1982 approximately 1.5 years before 4.2BSD was released. That was well before we had ftp servers of any sort (ftp did not show up in the source tree until January 1983). My best guess as to its purpose was to allow Bill to chroot into the /4.2BSD build directory and build a system using only the files, include files, etc contained in that tree. That was the only use of chroot that I remember from the early days.

— Dr Marshall Kirk Mckusick

chroot is used to put a process into a "changed root", a new root filesystem with limited or no access to the parent root filesystem. An extremely minimal chroot can be created on Linux as follows^[⁵^]:

# Get a shell
$ cd $(mktemp -d)
$ mkdir bin
$ $(which sh) bin/bash

# Find shared libraries required for shell
$ ldd bin/sh
	linux-vdso.so.1 (0x00007ffe69784000)
	/lib/x86_64-linux-gnu/libsnoopy.so (0x00007f6cc4c33000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6cc4a42000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f6cc4a21000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f6cc4a1c000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f6cc4c66000)

# Duplicate libraries into root
$ mkdir -p lib64 lib/x86_64-linux-gnu
$ cp /lib/x86_64-linux-gnu/libsnoopy.so \
    /lib/x86_64-linux-gnu/libc.so.6 \
    /lib/x86_64-linux-gnu/libpthread.so.0 \
    /lib/x86_64-linux-gnu/libdl.so.2 \
    lib/x86_64-linux-gnu/

$ cp /lib64/ld-linux-x86-64.so.2 lib64/

# Change into that root
$ sudo chroot .

# Test the chroot
# ls
/bin/bash: 1: ls: not found
#

There were problems with this early implementation of chroot, such as being able to exit that chroot by running cd..^[³^], but these were resolved in short order. Seeking to provide better security, FreeBSD extended the chroot into the jail^[³^,4] which allowed running software that desired to run as root and running it within a confined environment that was root within that environment but not root elsewhere on the system.

This work was further built upon in the Solaris operating system to provide fuller isolation from the host^[³^][⁶^]:

User separation (similar to jail)
Filesystem separation (similar to chroot)
A separate process space

Providing something similar to the modern concept of containers, processes running on the same kernel. Later, similar work took place in the Linux kernel to isolate kernel structures per process under "namespaces"^[⁷^].

However, in parallel, Amazon Web Services (AWS) launched their Elastic Compute Cloud (EC2) product which took a different approach to separate workloads: virtualising the entire hardware^[³^]. This has different tradeoffs; it limits the exploitation of the host kernel or isolation implementation; however, running the additional operating system and hypervisor meant far less efficient use of resources.

Virtualisation continued to dominate workload isolation until the company "dot-cloud" (now Docker), then operating as a "platform as a service" (PAAS) offering, open-sourced the software they used to run their PAAS. With that software and much luck, containers proliferated rapidly until Docker became the powerhouse it is now.

Shortly after Docker released their container runtime, they expanded their product offerings into build, orchestration and server management tooling^[⁸^]. Unhappy with this, CoreOS created its container runtime, rkt, which had the stated goal of interoperating with existing services, such as systemd, following the UNIX philosophy of "Write programs that do one thing and do it well^[⁹^]."

The Open Container Initiative was established to reconcile these disparate definitions of a container [10], after which Docker donated its schema and runtime as a defacto container standard.

There are now several container implementations and standards to define their behaviour.

Definition

It might be surprising to learn that a "container" is not a real thing but a specification. At the time of writing, this specification has implementations on^[11]:

Linux
Windows
Solaris
Virtual Machines

In turn, containers are expected to be^[¹²^]:

Consumable with a set of standard, interoperable tools
Consistent regardless of what type of software is being run
Agnostic to the underlying infrastructure the container is being run on
Designed in a way that makes automation easy
Of excellent quality

Specifications dictate how containers should reach these principles by defining how they should be executed (the runtime specification^[¹¹^]), what a container should contain (the image specification^[¹³^]) and how to distribute container "images" (the distribution specification^[¹⁴^]).

These specifications mean that various tools can be used to interact with containers. The canonical tool in most common use is the Docker tool, which in addition to manipulating containers, provides container build tooling and some limited orchestration of containers. However, there are many container runtimes:

As well as other tools that help with building or distributing images.

Lastly, extensions to the existing standards, such as the container networking interface, define additional behaviour where the standards are not yet clear enough.

Implementation

While the standards give us some idea as to what a container is and how it should work, it’s perhaps helpful to understand how a container implementation works. Not all container runtimes are implemented this way; notably, kata containers implement hardware virtualisation, as mentioned earlier with EC2.

The problems being solved by containers are:

Isolation of a process(es)
Distribution of that process(es)
Connecting that process(es) to other machines

With that said, let’s dive into the Docker implementation^[¹⁵^]. This uses a series of technologies exposed by the underlying kernel:

Kernel feature isolation: namespaces

The man namespaces command defines namespaces as follows:

A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource. Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes. One use of namespaces is to implement containers.

Paraphrased, a namespace is a slice of the system; from within that slice, a process cannot see the rest of the system.

A process must make a system call to the Linux kernel to change its namespace. There are several system calls:

clone: Create a new process. When used in conjunction with CLONE_NEW* it creates a namespace of the kind specified. For example, if used with CLONE_NEWPID the process will enter a new pid namespace and become pid 1
setns: Allows the calling process to join an existing namespace specified under /proc/[pid]/ns
unshare: Moves the calling process into a new namespace

There is a user command also called unshare which allows us to experiment with namespaces. We can put ourselves into a separate process and network namespace with the following command:

# Scratch space
$ cd $(mktemp -d)

# Fork is required to spawn new processes, and proc is mounted to give accurate process information
$ sudo unshare \
    --fork \
    --pid \
    --mount-proc \
    --net

# Here we see that we only have access to the loopback interface
root@sw-20160616-01:/tmp/tmp.XBESuNMJJS# ip addr
1: lo:  mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

# Here we see that we can only see the first process (bash) and our `ps aux` invocation
root@sw-20160616-01:/tmp/tmp.XBESuNMJJS# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.3  0.0   8304  5092 pts/7    S    05:48   0:00 -bash
root         5  0.0  0.0  10888  3248 pts/7    R+   05:49   0:00 ps aux

Docker uses the following namespaces to limit the ability for a process running in the container to see resources outside that container:

The pid namespace: Process isolation (PID: Process ID).
The net namespace: Managing network interfaces (NET: Networking).
The ipc namespace: Managing access to IPC resources (IPC: InterProcess Communication).
The mnt namespace: Managing filesystem mount points (MNT: Mount).
The uts namespace: Isolating kernel and version identifiers. (UTS: Unix Timesharing System).

These provide reasonable separation between processes such that workloads should not be able to interfere with each other. However, there is a notable caveat: we can disable some of this isolation^[16].

This is an extremely useful property. One example would be for system daemons needing access to the host network to bind ports on the host^[¹⁷^], such as running a DNS service or service proxy in a container.

TIP: Process #1 or the init process in Linux systems has some additional responsibilities. When processes terminate in Linux they are not automatically cleaned up, but rather simply enter a terminated state. It is the responsibility of the init process to "reap" those processes, deleting them so that their process ID can be reused^[¹⁸^]. Accordingly the first process run in a Linux namespace should be an init process, and not a user facing process like mysql. This is known as the zombie reaping problem.

Resource isolation: control groups

The kernel documentation cgroups defines the cgroup as follows:

Control Groups provide a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour.

That doesn’t really tell us much, though. Luckily it expands:

On their own, the only use for cgroups is for simple job tracking. The intention is that other subsystems hook into the generic cgroup support to provide new attributes for cgroups, such as accounting/limiting the resources which processes in a cgroup can access. For example, cpusets (see Documentation/cgroup-v1/cpusets.txt) allow you to associate a set of CPUs and a set of memory nodes with the tasks in each cgroup.

So, cgroups are a groups of "jobs" that other systems can assign meaning to. The systems that currently use this cgroup systems:

As well as various others.

cgroups are manipulated by reading and writing to the /proc filesystem. For example:

# Create a cgroup called "me"
$  mkdir /sys/fs/cgroup/memory/me

# Allocate the cgroup a max of 100Mb memory
$ echo '100000000' | sudo tee /sys/fs/cgroup/memory/me/memory.limit_in_bytes

# Move this proess into the cgroup
$ echo $$  | sudo tee /sys/fs/cgroup/memory/me/cgroup.procs
5924

That’s it! This process should now be limited to 100Mb total usage

Docker uses the same functionality in its --memory and --cpus arguments, and it is employed by the orchestration systems Kubernetes and Apache Mesos to determine where to schedule workloads.

TIP

Although cgroups are most commonly associated with containers that’s already used for other workloads. The best example is perhaps systemd, which automatically puts all services into a cgroup if the CPU scheduler is enabled in the kernel^[²⁰^]. systemd services are … kind of containers!

Userland isolation: seccomp

While both namespaces and cgroups go a significant way to isolating processes into their containers Docker goes further than that to restrict what access the process can have to the Linux kernel itself. This is enforced in supported operating systems via "SECure COMPuting with filters", also known as seccomp-bpf or simply seccomp.

The Linux kernel user space API guide defines seccomp as:

Seccomp filtering provides a means for a process to specify a filter for incoming system calls. The filter is expressed as a Berkeley Packet Filter (BPF) program, as with socket filters, except that the data operated on is related to the system call being made: system call number and the system call arguments.

BPF, in turn, is a small, in-kernel virtual machine language used in several kernel tracing, networking and other tasks^[²¹^]. Whether the system supports seccomp can be determined by running the following command^[²²^]:

$ grep CONFIG_SECCOMP= /boot/config-$(uname -r)

# Our system supports seccomp
CONFIG_SECCOMP=y

Practically this limits a process’s ability to ask the kernel to do certain things. Any system call can be restricted, and docker allows the use of arbitrary seccomp "profiles" via its --security-opt argument^[²²^]:

docker run --rm \
  -it \
  --security-opt seccomp=/path/to/seccomp/profile.json \
  hello-world

However, most usefully, Docker provides a default security profile that limits some of the more dangerous system calls that processes run from a container should never need to make, including:

clone: The ability to clone new namespaces
bpf: The ability to load and run bpf programs
add_key: The ability to access the kernel keyring
kexec_load: The ability to load a new Linux kernel

As well as many others. The full list of syscalls blocked by default is available on the Docker website.

In addition to seccomp there are other ways to ensure containers are behaving as expected, including:

Linux Capabilities^[²³^]
SELinux
AppArmour
AuditD
Falco^[²⁴^]

Each of these takes slightly different approaches to ensuring the process is only executed within expected behaviour. It’s worth spending time investigating the tradeoffs of each security decision or simply delegating the choice to a competent third-party provider.

Additionally, it’s worth noting that even though Docker defaults to enabling the seccomp policy, orchestration systems such as kubernetes may disable it^[²⁵^].

Distribution: the union file system

To generate a container, Docker requires a set of "build instructions". A trivial image could be:

# Scrath space
$ cd $(mktemp -d)

# Create a docker file
$ cat < Dockerfile
FROM debian:buster

# Create a test directory
RUN mkdir /test

# Create a bunch of spam files
RUN echo $(date) > /test/a
RUN echo $(date) > /test/b
RUN echo $(date) > /test/c

EOF

# Build the image
$ docker build .
Sending build context to Docker daemon  4.096kB
Step 1/5 : FROM debian:buster
 ---> ebdc13caae1e
Step 2/5 : RUN mkdir /test
 ---> Running in a9c0fa1a56c7
Removing intermediate container a9c0fa1a56c7
 ---> 6837541a46a5
Step 3/5 : RUN echo Sat 30 Mar 18:05:24 CET 2019 > /test/a
 ---> Running in 8b61ca022296
Removing intermediate container 8b61ca022296
 ---> 3ea076dcea98
Step 4/5 : RUN echo Sat 30 Mar 18:05:24 CET 2019 > /test/b
 ---> Running in 940d5bcaa715
Removing intermediate container 940d5bcaa715
 ---> 07b2f7a4dff8
Step 5/5 : RUN echo Sat 30 Mar 18:05:24 CET 2019 > /test/c
 ---> Running in 251f5d00b55f
Removing intermediate container 251f5d00b55f
 ---> 0122a70ad0a3
Successfully built 0122a70ad0a3

This creates a docker image with the id of 0122a70ad0a3 containing the contents of date at a, b and c. We can verify this by starting the container and examining its contents:

$ docker run \
  --rm=true \
  -it \
  0122a70ad0a3 \
  /bin/bash

$ cd /test
$ ls
a  b  c
$ cat *

Sat 30 Mar 18:05:24 CET 2019
Sat 30 Mar 18:05:24 CET 2019
Sat 30 Mar 18:05:24 CET 2019

However, in the docker build command earlier, Docker created several images. If we run the image after only a and b have been executed, we will not see c:

$ docker run \
  --rm=true \
  -it \
  07b2f7a4dff8 \
  /bin/bash
$ ls test
a  b

Docker is not creating a whole new filesystem for each of these images. Instead, each of the images is layered on top of each other. If we query Docker, we can see each of the layers that go into a given image:

$ docker history 0122a70ad0a3
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
0122a70ad0a3        5 minutes ago       /bin/sh -c echo Sat 30 Mar 18:05:24 CET 2019…   29B
07b2f7a4dff8        5 minutes ago       /bin/sh -c echo Sat 30 Mar 18:05:24 CET 2019…   29B
3ea076dcea98        5 minutes ago       /bin/sh -c echo Sat 30 Mar 18:05:24 CET 2019…   29B
6837541a46a5        5 minutes ago       /bin/sh -c mkdir /test                          0B
ebdc13caae1e        12 months ago       /bin/sh -c #(nop)  CMD ["bash"]                 0B
           12 months ago       /bin/sh -c #(nop) ADD file:2219cecc89ed69975…   106MB

This allows docker to reuse vast chunks of what it downloads. For example, given the image we built earlier, we can see that it uses:

A layer called ADD file:… — this is the Debian Buster root filesystem at 106MB
A layer for a that renders the data to disk at 29B
A layer for b that renders the data to disk at 29B

And so on. Docker will reuse the Add file:… Debian Buster root for all images that start with FROM: debian:buster.

This allows Docker to be highly space efficient, reusing the same operating system image for multiple executions.

TIP

Even though Docker is hugely space efficient, the docker library on disk can grow extremely large and transferring large docker images over the network can become expensive. Therefore, try to reuse image layers where possible and prefer smaller operating systems or the scratch (nothing) image where possible.

These layers are implemented via a Union Filesystem or UnionFS. There are various "backends" or filesystems that can implement this approach:

overlay2
devicemapper
aufs

Generally speaking, the package manager on our machine will include the appropriate underlying filesystem driver; docker supports many:

$ docker info | grep Storage
Storage Driver: overlay2

We can replicate this implementation with our overlay mount fairly easily^[²⁶^]:

# scratch
cd $(mktemp -d)

# Create some layers
$ mkdir \
  lower \
  upper \
  workdir \
  overlay

# Create some files that represent the layers
$ touch lower/i-am-the-lower
$ touch higher/i-am-the-higher

# Create the layered filesystem at overlay with lower, upper and workdir
$ mount -t overlay \
    -o lowerdir=lower,upperdir=upper,workdir=workdir \
    ./overlay \
    overlay

# List the directory
$ ls overlay/
i-am-the-lower  i-am-the-upper

Docker goes so far as to nest those layers until the multi-layered filesystem has been successfully implemented.

Files that are written are written back to the upper directory in the case of overlay2. However, Docker will generally dispose of these temporary files when the container is removed.

TIP

Generally speaking, all software needs access to shared libraries found in static paths in Linux operating systems. Accordingly, it is the convention to simply ship a stripped-down version of an operating system’s root file system such that users can install it and applications can find the libraries they expect. However, it is possible to use an empty filesystem and a statically compiled binary with the scratch image type.

Connectivity: networking

As mentioned earlier, containers make use of Linux namespaces. Of particular interest when understanding container networking is the network namespace. This namespace gives the process separate:

(Virtual) ethernet devices
routing tables
iptables rules

For example,

# Create a new network namespace
$ sudo unshare --fork --net

# List the ethernet devices with associated ip addresses
$ ip addr
1: lo:  mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

# List all iptables rules
root@sw-20160616-01:/home/andrewhowden# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

# List all network routes
$ ip route show

By default, the container has no network connectivity — not even the loopback adapter is up. We cannot even ping ourselves!

$ ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1): 56 data bytes
ping: sending packet: Network is unreachable

We can start setting up the expected network environment by bringing up the loopback adapter:

$ ip link set lo up
root@sw-20160616-01:/home/andrewhowden# ip addr
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever

# Test the loopback adapter
$ ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1): 56 data bytes
64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=0.092 ms
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.068 ms

However, we cannot access the outside world. In most environments, our host machine will be connected via ethernet to a given network and either have an IP assigned to it via the cloud provider or, in the case of a development or office machine, request an IP via DHCP. However, our container is in a network namespace of its own and does not know the ethernet connected to the host. We need to employ a veth device to connect the container to the host.

veth, or "Virtual Ethernet Device" is defined by man vet as:

The veth devices are virtual Ethernet devices. They can act as tunnels between network namespaces to create a bridge to a physical network device in another namespace, but can also be used as standalone network devices.

This is precisely what we need! Because unshare creates an anonymous network namespace, we need to determine what the pid of the process started in that namespace is^[²⁷^]+[<>]+:

$ echo $$
18171

We can then create the veth device:

$ sudo ip link add veth0 type veth peer name veth0 netns 18171

We can see these virtual ethernet devices appear both the host and the guest. However, neither has an IP attached nor any routes defined:

# Container

$ ip addr
1: lo:  mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: veth0@if7:  mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 16:34:52:54:a2:a1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
$ ip route show

# No output

To address that, we add an IP and define the default route:

# On the host
$ ip addr add 192.168.24.1 dev veth0

# Within the container
$ ip address add 192.168.24.10 dev veth0

From there, bring the devices up:

# Both host and container
$ ip link set veth0 up

Add a route such that 192.168.24.0/24 goes out via veth0:

# Both host and guest
ip route add 192.168.24.0/24 dev veth0

And voilà! We have connectivity to the host namespace and back:

# Within container
$ ping 192.168.24.1
PING 192.168.24.1 (192.168.24.1): 56 data bytes
64 bytes from 192.168.24.1: icmp_seq=0 ttl=64 time=0.149 ms
64 bytes from 192.168.24.1: icmp_seq=1 ttl=64 time=0.096 ms
64 bytes from 192.168.24.1: icmp_seq=2 ttl=64 time=0.104 ms
64 bytes from 192.168.24.1: icmp_seq=3 ttl=64 time=0.100 ms

However, that does not give us access to the wider internet. While the veth adapter functions as a virtual cable between our container and our host, there is currently no path from our container to the internet:

# Within container
$ ping google.com
ping: unknown host

To create such a path we need to modify our host such that it functions as a "router" between its own, separated network namespaces and its internet facing adapter.

Luckily, Linux is set up well for this purpose. First, we need to modify the normal behaviour of Linux from dropping packets not destined for IP addresses with which their associated but rather allow forwarding a packet from one adapter to the other:

# Within container
$ echo 1 > /proc/sys/net/ipv4/ip_forward

That means when we request public facing IPs from within our container via our veth adapter to our host veth adapter the host adapter won’t simply drop those packets.

From there we employ iptables rules on the host to forward traffic from the host veth adapter to the internet facing adapter — in this case wlp2s0:

# On the host
# Forward packets from the container to the host adapter
iptables -A FORWARD -i veth0 -o wlp2s0 -j ACCEPT

# Forward packets that have been established via egress from the host adapater back to the contianer
iptables -A FORWARD -i wlp2s0 -o veth0 -m state --state ESTABLISHED,RELATED -j ACCEPT

# Relabel the IPs for the container so return traffic will be routed correctly
iptables -t nat -A POSTROUTING -o wlp2s0 -j MASQUERADE

We then tell our container to send traffic it doesn’t know anything else about down the veth adapter:

# Within the container
$ ip route add default via 192.168.24.1 dev veth0

And the internet works!

$ # ping google.com
PING google.com (172.217.22.14): 56 data bytes
64 bytes from 172.217.22.14: icmp_seq=0 ttl=55 time=16.456 ms
64 bytes from 172.217.22.14: icmp_seq=1 ttl=55 time=15.102 ms
64 bytes from 172.217.22.14: icmp_seq=2 ttl=55 time=34.369 ms
64 bytes from 172.217.22.14: icmp_seq=3 ttl=55 time=15.319 ms

As mentioned, each container implementation can implement networking differently. There are implementations that use the aforementioned veth pair, vxlan, BPF or other cloud specific implementations. However, when designing containers we need some way to reason about what behaviour we should expect.

To help address this the "Container Network Interface" tooling has been designed. This allows defining consistent network behaviour across network implementations, as well as models such as Kubernetes shared lo adapter between several containers.

The networking side of containers is an area undergoing rapid innovation but relying on:

A lo interface
A public facing eth0 (or similar) interface

being present seems a fairly stable guarantee.

Landscape review

Given our understanding of the implementation of containers we can now take a look at some of the classic docker discussions.

Systems Updates

One of the oft-overlooked parts of containers is the necessity to keep them and the host system up to date.

In modern systems, it is pretty common to enable automatic updates on host systems, and so long as we stick to the system package manager and ensure updates stay successful, the system will keep itself both up-to-date and stable.

However, containers take a very different approach. They’re effectively giant static binaries deployed into a production system. In this capacity, they can do no self-maintenance.

Accordingly, even if there are no updates to the container's software, containers should be periodically rebuilt and redeployed to the production system — less they accumulate vulnerabilities over time.

Init within container

Given our understanding of containers its reasonable to consider the "1 process per container" advice and determine that it is an oversimplification of how containers work, and it makes sense in some cases to do service management within a container with a system like runit.

This allows multiple processes to be executed within a single container including things like:

syslog
logrotate
cron

And so fourth.

In the case where Docker is the only system that is being used it is indeed reasonable to think about doing service management within docker — particularly when hitting the constraints of shared filesystem or network state. However systems such as Kubernetes, Swarm or Mesos have replaced much of the necessity of these init systems; tasks such as log aggregation, restarting services or colocating services are taken care of by these tools.

Accordingly its best to keep containers simple such that they are maximally composable and easy to debug, delegating the more complex behaviour out.

In Conclusion

Containers are an excellent way to ship software to production systems. They solve a swathe of interesting problems and cost very little as a result. However, their rapid growth has meant some confusion in industry as to exactly how they work, whether they’re stable and so fourth. Containers are a combination of both old and new Linux kernel technology such as namespaces, cgroups, seccomp and other Linux networking tooling but are as stable as any other kernel technology (so, very) and well suited for production systems.

<3 for making it this far.

References

[1] “Docker.” https://en.wikipedia.org/wiki/Docker_(software) .

[2] “Cloud Native Technologies in the Fortune 100.” https://redmonk.com/fryan/2017/09/10/cloud-native-technologies-in-the-fortune-100/ , Sep. 2017.

[3] B. Cantrill, “The Container Revolution: Reflections After the First Decade.” , Sep. 2018.

[4] “Papers (Jail).” https://docs.freebsd.org/44doc/papers/jail/jail.html .

[5] “An absolutely minimal chroot.” https://sagar.se/an-absolutely-minimal-chroot.html , Jan. 2011.

[6] J. Beck et al., “Virtualization and Namespace Isolation in the Solaris Operating System (PSARC/2002/174).” https://us-east.manta.joyent.com/jmc/public/opensolaris/ARChive/PSARC/2002/174/zones-design.spec.opensolaris.pdf , Sep. 2006.

[7] M. Kerrisk, “Namespaces in operation, part 1: namespaces overview.” https://lwn.net/Articles/531114/ , Jan. 2013.

[8] A. Polvi, “CoreOS is building a container runtime, rkt.” https://coreos.com/blog/rocket.html , Jan. 2014.

[9] “Basics of the Unix Philosophy.” http://www.catb.org/ esr/writings/taoup/html/ch01s06.html .

[10] P. Estes and M. Brown, “OCI Image Support Comes to Open Source Docker Registry.” https://www.opencontainers.org/blog/2018/10/11/oci-image-support-comes-to-open-source-docker-registry , Oct. 2018.

[11] “Open Container Initiative Runtime Specification.” https://github.com/opencontainers/runtime-spec/blob/74b670efb921f9008dcdfc96145133e5b66cca5c/spec.md , Mar. 2018.

[12] “The 5 principles of Standard Containers.” https://github.com/opencontainers/runtime-spec/blob/74b670efb921f9008dcdfc96145133e5b66cca5c/principles.md , Dec. 2016.

[13] “Open Container Initiative Image Specification.” https://github.com/opencontainers/image-spec/blob/db4d6de99a2adf83a672147d5f05a2e039e68ab6/spec.md , Jun. 2017.

[14] “Open Container Initiative Distribution Specification.” https://github.com/opencontainers/distribution-spec/blob/d93cfa52800990932d24f86fd233070ad9adc5e0/spec.md , Mar. 2019.

[15] “Docker Overview.” https://docs.docker.com/engine/docker-overview/ .

[16] J. Frazelle, “Containers aka crazy user space fun.” , Jan. 2018.

[17] “Use Host Networking.” https://docs.docker.com/network/host/ .

[18] Krallin, “Tini: A tini but valid init for containers.” https://github.com/krallin/tini , Nov. 2018.

[19] https://chromium.googlesource.com/chromium/src.git/+/HEAD/docs/linux_sandboxing.md .

[[0pointer.resources]][20] L. Poettering, “systemd for Administrators, Part XVIII.” http://0pointer.de/blog/projects/resources.html , Oct. 2012.

[21] A. Howden, “Coming to grips with eBPF.” https://www.littleman.co/articles/coming-to-grips-with-ebpf/ , Mar. 2019.

[22] “Seccomp security profiles for docker.” https://docs.docker.com/engine/security/seccomp/ .

[23] “Linux kernel capabilities.” https://docs.docker.com/engine/security/security/#linux-kernel-capabilities .

[24] M. Stemm, “SELinux, Seccomp, Sysdig Falco, and you: A technical discussion.” https://sysdig.com/blog/selinux-seccomp-falco-technical-discussion/ , Dec. 2016.

[25] “Pod Security Policies.” https://kubernetes.io/docs/concepts/policy/pod-security-policy/#seccomp .

[26] Programster, “Example OverlayFS Usage.” https://askubuntu.com/a/704358 , Nov. 2015.

[27] “How do I connect a veth device inside an ’anonymous’ network namespace to one outside?” https://unix.stackexchange.com/a/396210 , Oct. 2017.

[28] D. P. García, “Network namespaces.” https://blogs.igalia.com/dpino/2016/04/10/network-namespaces/ , Apr. 2016.

What is an SLO? What is it suitable for?

Andrew howden — Tue, 01 Aug 2023 06:30:19 GMT

Recently, I’ve had the opportunity to help different organisations implement service-level objectives. The experience has been great, and the organisations are much better due to these clear boundaries (at least). But through this, a familiar series of questions or challenges have come up for each organisation. Today, I want to talk you through a service level objective and what they’re suitable for. Later, I’ll also publish some guidance on how you can practically implement a service-level objective in your organisation.

The road so far

Most of my experience with service-level objectives (or “SLOs”) has been at Zalando. Zalando started with SLOs in 2016 and has been on an evolving path to try and improve their effectiveness. You can read more about that journey in an excellent article by my former colleague Pedro Alves on the Zalando Engineering Blog. The journey hasn’t been smooth, and the organisation's handling of SLOs requires patience and enablement with engineers, managers, product colleagues and executives. You can learn more about how we use these in practice by watching the SLOConf video.

This article is designed to boil down that seven years of experience into something you can practically leverage within your organisations.

The fundamental problem

When leveraged collectively, engineers can produce software that can fundamentally change the economics of a business, dramatically reducing the cost of a required business operation and opening new markets! However, engineers are some of the more expensive colleagues to hire, and the talent pool for engineers is limited. Additionally, for any given software, there’s a hard limit on how many engineers it's possible to add before the amount of work required to coordinate those engineers outweighs the additional capacity a single engineer adds.

Those of us responsible for the engineering time of ourselves and others agonise over the question:

How do I spend my engineer's time to benefit the organisation as much as possible?

Even if we have engineering time, an engineer's work is more than adding new, exciting user features to our product. They also need to ensure the software:

Continues to work as expected for users, especially when user demand changes or new features are released.
Remains secure as new vulnerabilities are discovered.
Is sufficiently clear and straightforward so that new colleagues can be onboarded and effectively contribute
It runs efficiently, so the business pays little for the underlying computing.

Different organisations need to make different tradeoffs. A startup might not care as much about reliability or cost as it does about implementing the significant new features required to onboard a new client. The question is then how to split our engineering capacity amongst these tasks. We could choose it based on recent input, managerial decision or gut feeling, but there’s a better way — through data.

Measuring what matters

To establish a framework to govern engineering time, we need to be able to measure the value we get out of different kinds of engineering work. The deciding factor for where we spend engineering time is a critical business metric — financial return on investment, reduced risk of financial loss, customer lifetime value and engagement. To switch to reliability work, we need to be able to demonstrate what our current level of reliability is, as well as what the likelihood and business cost of unavailability will be.

The best way to do this is to try and measure the availability of the customer experience as close to the customer as possible. This could be at the edge of your software system (e.g. API Gateway) if you do not own the client device app, or directly in that device if you do. Try and your measurement up into segments that are meaningful to the business. We frequently use customer operations such as “add to cart” or “place order”. Each of these measurements becomes what we call a service level indicator.

Invariably, you’ll spot periods in which your reliability drops. Take a look at what happens to other critical customer metrics both during and after these periods, and see how much you’re losing as a consequence of unreliability. Keep track of it over 60 days (for example), and write a report for management, making it clear what the cost of unavailability was and what it would have been should we have kept other levels of reliability.

Setting the target

Once you’re armed with data, the next step is to work with management to make this data actionable. There are two things that you establish here:

The Target

The first thing to do is to recommend a specific target that, based on your data, is achievable without sacrificing too much product development (or other critical deliverables) but that will positively impact the customer experience and core business metrics. This target is the actual service level objective. The hint is in the name! It’s our objective

It’s often easiest to pick a number similar to the median historic availability — just be as available as you are “normally”. Suppose you’ve suffered some major issues in the meantime. In that case, there will be a difference in the median availability and the mean — the target already means that you need to spend more time on reliability work. It’s often much easier to sell management on continuing to achieve what you do on average already, and just fixing “those outliers”, and from a management perspective, it limits the maximal investment required.

An SLO is usually expressed over periods. For example:

will have an availability of 99.5% on average over 28 days.

The culture

The second and more important thing to do is to establish a routine of reviewing the service level objectives with your management or engineering planning stakeholders and using it to prioritise work to improve reliability. After all, that’s the heart of the challenge we’re looking to solve and the very purpose we designed the SLO for! It’s also the part I frequently see engineering stakeholders skip.

The error budget is the best metaphor for understanding when to prioritise reliability over other work. It is essentially “1 minus the SLO of the service”. For example, if we have an SLO of 99.5% on our operation, we say we have a “0.5%” error budget. We fully expect to spend that error budget each month! We might introduce these errors during migrations, unexpected failures, bad deployments or any other issues that carry some technical risk.

So long as we remain within that 0.5% error budget, we do not need to add more capacity to reliability work. If we exceed that 0.5% error budget, we shift our engineering allocations to prioritize reliability work until the SLO returns to where we expect.

We need our management stakeholders to agree and buy into this direction. Ultimately they are the ones who are both accountable for and drive our work, and, fundamentally, they understand the value of this approach as well as build a culture where it is respected.

Taking Action

An SLO’s error budget can be quickly exhausted by periods of complete unavailability (a “fast burn” mode) as well as an issue that “slowly leaks” unavailability (a “slow burn” mode). They need different controls to take action.

Fast Burn

For fast burn modes, the best thing to do is to set alerts on the burn rate of the error budget and forward those to an on-call team member to take action when they’re received. Google has already presented excellent work on the math of doing this efficiently; the key takeaway for this article is that you should treat it as an emergency or “incident” and prioritise fixing it above all other work.

Slow Burn

For slow burn modes, there is far less urgency to the response. A slow burn issue will mean the error budget is exhausted, but it will take many days (or even weeks!) to do so. It’s most frequently introduced due to some deployment an engineer didn’t entirely pay as much attention to as they should.

To catch these, we can either rely on the same math Google provides but with a much longer window or periodically review a projection of the SLO at the end of the period. If the SLO is projected to go outside its budget, we take action. If not, we continue with the same trade-offs we’re making now.

In Conclusion

Determining where to spend engineering time is at the heart of any modern, especially internet-based business. Service level objectives (or SLOs) are invaluable tools for helping us moderate where we spend that time clearly and objectively, moderated by the actual value of reliability for our business. We need to work hard to make sure we measure what is critical for the customer experience and, correspondingly, for the business to work rather than what is easy. We also need to work to establish the value of reliability by establishing the loss in its absence. If we do this, we can work with management to set a culture where this is prioritized and set routines both for fast and slow reliability challenges.

Or, we can get much more done with much less discussion.

Hey, did you enjoy this? Great! Want more? Me too. Help me make it happen by sharing this with a friend or on your social? I appreciate you 🙏

Help I'm now on call!

Andrew howden — Fri, 28 Jul 2023 06:46:00 GMT

Let’s imagine for a minute that you suddenly find yourself being asked to go “on-call” for a given production service. You don’t quite know what being “on-call” is, except that many senior engineers do it, which seems necessary. You’re excited that someone asked you to be on-call, but you’re not sure you want to take on that responsibility and are worried about how it will affect your family time.

I have some excellent news: This makes you Perfectly Normal. Being on-call is a weird situation! I’m writing this guide for you, hoping you become more familiar with this responsibility and more ambitious to take it on yourself. It is an excellent way to further your engineering career, learn about production systems, and take responsibility for the customer experience.

I’ve been on call for most of my software engineering career. I’ve gone through many, many iterations — from being “unofficially” on-call (read: always on-call without compensation) to being part of an engineering team that’s set up and is managing a rotation to designing incident response processes to being the “Incident Commander” for a billion dollar European fashion company. I’ve felt almost every part of being on-call, from the “Oh god” moment of breaking production systems, the 3 am “Not this bug again 🤦” drag to the “Oh buddy! Don’t worry, we’ve got this” moment helping a responder recover from their production challenges.

To best prepare for 24x7, we should first understand why an organisation maintains this capability.

TL, DR

All modern software businesses need on-call.
The key to successful on-call is preparation. Figure out your responsibilities, learn the process, figure out your surroundings, understand how to debug and adjust your service in production, understand the significant projects happening and ensure your equipment is prepared.
When the pager goes off, triage the issue. Figure out what is happening through the telemetry you studied earlier, and work with teams around you to figure out an intervention. Try it, and then go on. Be sure to communicate clearly as you’re responding.
When it’s all over, go through the process of learning from your experience. Document the impact, what happened when a causal tree of anything interesting and then a summary for executive readers.
Through preparation, you’ll be fine. Try it!!

The ever-present service offering

Modern internet-based businesses are expected to be available at all hours of the day. In many cases, the time at which the majority of customers are accessing the service is outside the hours of those who are working on that service. A business might serve the vast majority of customers when no one is looking!

This means the riskiest, highest profit period is often when no one is at their desk! And if something breaks, the business is burning trust with far more customers than might have happened during the day. This is an untenable outcome for many companies; they will not survive burning their customer's trust in this way. Given this, it is essential to have someone tasked with responding to issues as they’re reported. As a result, if a business's customer group is primarily anchored around business hours, it's often not worth maintaining the 24x7 capability. It might still be worth having an on-call team — but one that operates close to business ours (e.g. 08:00→ 18:00).

In my experience, the most effective on-call teams operate in groups of 5 - 6 people drawn from teams that maintain the application code and the infrastructure definitions of a given application. These team members are usually “senior” and have experience with many different states of the application (good and bad), and can debug an extensive range of failure modes. They often occupy other senior positions, such as writing the architectural guidance of a given application or making tradeoffs around technology choices. They work in shifts, typically for a week and are on-call once every 5 - 6 weeks (in rotation with their colleagues). They carry a dedicated mobile device on a high-quality network during this period. They can be in front of a computer debugging a production issue within 5 - 30 minutes of receiving a notification.

This is a substantial commitment from these colleagues; they should be financially compensated — especially if it is a commitment beyond the normal expectations of software engineering.

> 💡 If you’ve never done on-call, these colleagues can be intimidating. They are usually very experienced colleagues with opinions based on that experience. They can be stubborn about an approach because they ultimately pay the price for poor outcomes. However, it’s important to remember they’re ultimately human. The best way to join them is to try being on-call!

The problem not solved.

Today, we’ll discuss joining a healthy, high-performing team as a new on-call colleague. Setting up a new on-call team is considerably more challenging. Let me know if you’re interested in this in the comments section!

Preparing our gear

We usually think about being on-call as responding to production issues. However, in my experience, the key to being an influential on-call team member is the same as being an effective fire marshal or first aider — preparation. What happens in the incident itself is driven by how much time we spend in practice and not based on the skill or intellect of any given responder.

Things that you should look into before you go on call include:

Figure out your responsibilities

There are usually multiple on-call teams in any large organisation (i.e. more than 50 engineers). Each of these engineers is responsible for a subset of production systems and works with systems run by other teams. The first thing to look at is what your team is responsible for. There should be a list of the following:

Applications (binaries you have running in production)
Endpoints (DNS endpoints and routes you have exposed)
Business Processes

If you’re joining a team that has been there long, you may have to write or update this list. That’s fine — it’s an opportunity to learn the system's boundaries. You should figure out the stakeholders of each application, endpoint or business process and what the value of that process is to the business. This allows you to determine how urgently to intervene and how much risk you should take with the response.

💁 I have a series of “canned responses” for various failures that I can copy-paste into chat, email or other communication tools as they happen. They’re just stored on disk, and I get them through “cat | xsel –clipboard”. I find these extraordinarily useful to get some critical information out quickly and clearly in a way I otherwise can’t while debugging these systems.

Learn the process

Organisations tend to have a clear boundary of separation between things that are “kind of bad” and “incidents”. Incidents are extraordinary — we are encouraged to drop all other work no matter what time of day it is, we can requisition colleagues immediately, we’re prepared to accept more risk deploying changes, and we communicate in ways that might otherwise be considered rude.

We must delimit when entering “Incident Response” versus any normal part of the software delivery life-cycle. How to do this varies depending on the organisation, but it could be:

The creation of an “Incident” artifact in something like OpsGenie, PagerDuty or Jira
Using a unique phrase (e.g. “This is an incident”) in a chat tool
Creating a new thread in a chat tool in a particular channel, or a channel dedicated to the purpose

Frequently there are “chat ops” tools that make this transition easier. Once you’re in an incident process, there are other tasks that you’ll need to learn how to do to make the process run smoothly. Things like:

Involving another team if you figure out you need their assistance to resolve the issue
Update a status page to let non-technical stakeholders know what the status of the problem is
Notify colleagues of discoveries or interventions to collaborate with them on finding a solution
Mark the problem as repaired once you’ve made the system stable to return to “normal operations”.

Figuring out this stuff beforehand will save you enormous stress in the incident and allow you to quickly find the people you need to help you mitigate an issue.

💁 Recently, I’ve seen quite a few people struggle to interact with the “ChatOps” tools, or if the “ChatOps” tool is unavailable, be unable to manage an incident. This costs us time, which in turn costs us money. Try and get very familiar with these tools ahead of time!

Feel out your surroundings

Once you’ve figured out your responsibilities and how to interact with the process, the next thing to do is figure out the duties of those around you. Teams that operate services on which your services depend, as well as stakeholders that are likely to reach out if they notice an issue. People who need to be notified in case something in your responsibility is unavailable.

In particular, you want to know how to communicate with these people in an emergency. If it's another on-call team, figure out what the team is called and what operation you need to do to include them in the response. If it is a stakeholder, figure out how they prefer to receive emergency updates and write a process to deliver it as they expect.

Lastly, please get to know these people and their perspectives. Meet them, and introduce yourself and how you will work during an incident process. Sit with them and review how they view their applications, tooling or function so that you can understand their position during a response.

This will allow you to build a mnemonic or playbook to manage interacting with the “in-anger” that balances what they need with the urgency of response. This also goes a tremendous way to salve feelings if you inadvertently communicate in a terse (or rude) way during an incident due to stress.

Understand your service in production.

So! You’ve figured out your area of responsibility, the process requirements and who you’ll be working with. The next thing to do is build expertise in the software or infrastructure that’s within the domain you’re responsible for. After all, you’ll need to debug it when something goes wrong!

Some of the ways I’ve done this in the past include:

Review all of the playbooks. Any playbooks written about the service are reviews of previous failures that someone has been kind enough to write guidance for! They are invaluable as both a preparation and response tool.
Review the dashboards, logs and metrics. The telemetric data the application generates is the same data you’ll need to rely on when the system malfunctions. You’ll need to be familiar with it and be able to spot deviations from regular traffic.
Review the configurable aspects of the application. The application configuration is “anything that can be changed outside a deployment” and includes anything from the number of replicas to whether or not a specific feature flag is enabled. You should be able to look at the current configuration, a history of how the configuration has changed, and confidently update values in the configuration.
Deploy the application. Sometimes, when something goes wrong, the only thing that will address it is a change to the source code. Whether this is a rollback to a previous version or a “fix-forward” where you’re writing and merging a patch, deploying the application is a skill you’ll need to be familiar with.
Review previous incidents. Hopefully, as our colleagues have had challenges with our applications, they’ve improved it so we do not have repeat issues. Still, incidents are beautiful opportunities to learn how our software deviates from what we expected. Reviewing past incidents can help us understand where the “sharp edges” of our systems are.

Once you’ve tried these tasks, I encourage you to try and improve them. Try and write a new playbook for a failure mode you anticipate, improve a telemetry view or graph or improve the default for a configuration. As you drive these improvements, your conversations with colleagues will teach you more than just the artifacts.

Understand the significant projects in your area of responsibility.

One of the things that you’ll quickly learn as you go on call is the majority of issues with production systems are a result of some change that has been recently introduced. This is normal — software can be incredibly complex! It’s challenging to anticipate every change's consequences; the more significant the change, the larger the space for unanticipated consequences.

An excellent way to get ahead of the “likely future failures” is to keep an eye on things that will introduce significant change. Things like:

Substantial changes to the customer behaviour
Significant architectures of the system
Major upgrades of runtimes, libraries, frameworks etc
Rewrites of service behind an API

Large projects also tend to run overdue, so engineers are more likely to sacrifice reliability work to get the project delivered. As an on-call team member, your task is not to get in the way of such a release but to empower the team as much as possible to make the release safe. You can go a long way toward safety by ensuring the team has a clear plan to roll back the change if it goes wrong and when to invoke this rollback plan.

Even with the owning team taking a healthy level of responsibility for a change, changes tend to go live during “quiet periods” of the day, and the on-call team members are responsible for that exact change during the busiest, highest impact periods. You will likely be responsible for adjusting or rolling back this change during the response. You should prepare ahead of time by making sure the exit criteria and path are clear enough that you can execute them.

Get your hardware & software prepared

Lastly, for preparation, there are things that we need to have to interact with production systems. These include

A functioning laptop. Ensure your laptop is charged and all updates are installed before the shift starts.
A functioning phone to receive notifications. As with the laptop, ensure it's charged and updated.
Chargers. Sometimes something goes wrong for a sufficiently long enough time that our devices run flat. We need to be able to plug them into a socket.
Network access to production. Unless you’re sleeping at the office, you’ll likely be at home when you receive notifications! You should ensure a stable internet connection and, ideally, a cell-based backup connection.
Permission to access production. At any modern corporation, permissions tend to degrade over time. Either a system will make you “less privileged” or remove your access to some systems entirely. This is a good and healthy implementation of the principle of least privilege, but we need to validate we have sufficient permissions before we go on shift.

If you’re a new colleague, having a checklist for these things that you physically mark off before the shift is good. More experienced colleagues tend to maintain this out of habit — but also, occasionally, forget.

The unexpected adventure

So! The day has come. Hopefully, you prepared as much as possible, but finally, something has broken so severely that it requires emergency intervention. Let’s talk about what to do next.

💓 When you get that page it can be extremely stressful. All the more so if things do not go smoothly; maybe you have the manager asking questions and your teammates disappearing or maybe you’re alone at 04:00.
This is normal. It gets better after a few live incidents, but never entirely goes away. Take a minute to take a breath, compose yourself and then continue. No matter what happens next, you’ll be at your best with a clear(ish) head.

Triage the issue

The first thing to do is understand the issue's scope and magnitude. That lets us decide what level of risk to tolerate when trying to fix it and how many people to wake up or assign to help address it.

Often, the incident process has multiple “levels”. If unsure, pick the more severe one — you can apologise later. Raise the incident at the appropriate “severity”, and communicate your understanding of what users are experiencing.

Intervene

Review the telemetry data from your application and leverage your engineering skills to understand the pathological condition.

The specifics of debugging production depend on the architecture of your system. There are many good heuristic models to understand systems (e.g. USE), but debugging is outside the scope of this article. Instead, what I want to mention is just that you’ll need to go through the OODA loop:

Observe: Figure out what you can understand by reviewing the telemetry from the system.
Orient: Make that valuable information by leveraging your knowledge of the system and its context (e.g., is it a sales period or a significant release today).
Decide: Figure out what to do with that information.
Act: Intervene
Repeat

As you’re going through the steps, periodically communicate. Most useful are if you’ve learned something new, you’re about to act, or you just acted. When you communicate, communicate statelessly. That means don’t assume that anyone reading your communication has been following along so far, understands the system or follows your hypothesis. As you’re supplying numbers, be specific (i.e. 300ms) rather than relative (“Huge latency”). Lastly, be kind but direct — be clear in what you’re saying, but don’t spend time with linguistic polish.

Lastly, focus your investigation only on how to restore the production experience. While how production broke is an important question, it is a question that’s distinct and less useful than how to restore the production experience.

💁 I’ve seen quite a lot of incidents in which a change has materially changed the performance characteristics of an application. We don’t usually figure that out until peak traffic hits (or a sale). In that case, the question is not “What made the application slow”, is “How far can we scale this application out? What are the bottlenecks?”
The latter is much simpler and will lead you to restore production much faster.

Once you’ve figured out a good intervention, do it (and communicate it). If it doesn’t work, roll it back, share that and try again. If you get stuck or need help from another team, include them!

Review

Hopefully, at this point, you’ve been able to make some changes to the production system so that the customer experience has been restored. You’re not quite sure why the system broke, but it's sufficiently stable that you can breathe.

Now, see if you can figure out how stable the system will be over the next few hours. Your goal now is to buy time to hand the problem back to the owning engineering team so that they can deal with it properly. If you need to intervene further to ensure this stability, do so, and communicate that.

If you’ve made changes to configuration, code or anything else, ensure these changes are reflected in version control. Having someone deploy and undo all your precious recovery doesn't feel great!

Repair

Once you’re confident the system is stable, mark it repaired and communicate that you do not expect further issues. Write down everything you think is relevant in some notes for reviewing the incident, and step away from the keyboard for a bit.

📱Take your phone! While most systems behave after we’ve intervened, sometimes they do not. If they break again, you will be paged again.

Learning and improving

Whew! So far, we’ve gone through preparation and then leveraged that practice to restore our production issue. All done or?

So, as you might guess, you’re not off the hook just yet. Once the system has reached stability, we need to understand how it got into its failure condition so we can adjust it to be less likely to do that in future or adapt ourselves and our processes to restore the system more quickly. We go through this understanding through something called a postmortem. Check out the linked post for details!

In Conclusion

Being on call can be a pretty hectic experience. In this post, we covered the background of being on call, the preparation, what to do in an incident, and what to do after. I’m confident that if you go through this preparation yourselves, you’ll be far more capable of helping improve your user's experience than you would be if you just gave it a go. Hopefully, this post gives you tips to feel more confident as you take that pager.

If not, hit me up in the comments so I can help improve it. Thanks ♥

Running an effective meeting

Andrew howden — Fri, 21 Jul 2023 16:39:15 GMT

One of my colleagues said something incredibly insightful the other day. Something I fully intend to steal and shamelessly reuse:

Routine meetings decay in quality — Lukas Wilhelm

As I’ve gone through my career, I have been part of … many thousands of meetings. Planning meetings, retro meetings, project meetings, 1:1s, and reviews are an invariable necessity of working with our peers.

However, not all meetings are created equal. Some of the meetings I’ve been in have been nothing but a monument to whoever organized it, without a clear purpose or buy-in from other attendees and certainly no clear value delivered at the end of it. Others have unlocked weeks worth of work in 30 minutes.

That begs the question: How do we conduct an effective meeting? How do we best use people’s limited time and attention to drive a positive organizational outcome?

The utility of a meeting

If you want to find out the utility of a meeting, you can do so quickly: Ask yourself what you changed due to being in that meeting. Did you adjust your own deliverables? The deliverables of someone else? Did that meeting save time?

As a corollary, if you regularly join a series of meetings and do not adjust your actions, stop joining. You can have that conversation via email, or better yet — not at all.

Unless it’s a coffee, there’s always time for coffee.

The tool for the job

Many different kinds of meetings happen throughout an organization’s lifecycle. For example:

Accountability Meetings. These are usually called “status meetings” or “project check-ins”, but are primarily designed to inform stakeholders about the state of a given project. Invariably, they’re also an opportunity for stakeholders to set expectations regarding where the project should be. This means people tend to work pretty hard just before whenever these meetings are scheduled.
Decision Meetings. These are meetings where an organisational tradeoff needs to be made (usually by a senior leader), who gathers a set of stakeholders together to query and then makes that decision.
Kickoff Meetings. These are where there’s a new project, process, team or other change that needs to be managed, and we need to build urgency and a shared understanding of the new expected normal.
Planning Meetings. These are routine meetings that review some work artifacts (e.g., tickets, operational data, performance data) and adjust how to work over the next period.
Retrospectives. These are meetings designed to collaboratively review a team's performance, a project, or whatever and figure out how to evolve the approach going forward.
1:1s. These meetings are designed for context exchange, to check in on how a colleague feels and their morale, and to help shape their specific contributions to a given team.
Coffee / Beer. These meetings are deliberately unstructured and unrecorded, and their topics can range wildly. However, they are often where colleagues determine how to deliver the value of their work, rather than how to complete their assignments.
Presentations. These meetings are mechanisms where a colleague will share their experience.

These meetings are all beneficial. There’s a second class of meetings which I’ll briefly mention, which in my experience are less useful.

Brainstorming. Shared brainstorming is a suitable mechanism for achieving consensus but a bad mechanism for delivering quality insight. Instead, what tends to happen is the group shifts into groupthink as quickly as possible, and what comes out is the mean and not the valuable insight.
Lecture. A meeting doesn’t tend to start out as a lecture. Still, it can shift into one as a participant dominates the conversation to fill the space until no argument is possible, rather than communicating to be listened to.

Each of these meetings has a unique culture and set of expectations. The first step to maximizing the value of a given meeting is to understand what kind of meeting you’re in and to ensure that other participants joining the meeting have the same shared understanding.

The minimum requirements

Some things are familiar to most meetings — things you can do almost no matter the occasion. These include:

Assign Responsibility

Within a meeting, some people should have particular responsibilities within the meeting itself. These include:

Organizer. The person who writes and circulates an agenda, ahead of time.
Chair. The person directing the meeting, and who is responsible for ensuring that the agenda is followed, we do not drift too far from the topic, specific individuals cannot monopolize time, and others cannot remain entirely reticent.
Scribe. The person who will note down the actions we need to take as a result of this meeting. This allows us to hold each other accountable in the next meeting, and ensure this meeting actually has value.

A single person generally cannot do all of these roles — they spend the time in the meeting switching between them, and one role (e.g. scribe) can come at the cost of another (e.g. chair).

Prepare

The most crucial step in a successful meeting is preparing the meeting ahead of time.

To deliver a meeting, there needs to be:

A clear vision. Some reason that it’s worth gathering all of the people together to address a given problem — a common purpose we’re all working toward.
A direction. A path that we take toward our vision, through a set of (pre-made) strategic choices.
Deliverables. Things that specific people need to do by a specific time and are accountable to a specific leader.
Help. A place that people can go when they invariably get lost or drift out of alignment with the group.

A meeting — indeed, any organizational direction setting — without these tends to last only a limited period of time, or descend quickly into in-fighting with people motivated to solve their problems and not the problem of the group.

This is most easily done by preparing it as part of an Agenda document that’s prepared early and sent around ahead of time. If it can’t be sent around in time, schedule a 10-minute block at the start of the meeting for all stakeholders to read it.

My preferred Agenda template looks like

It includes:

A goal describing the purpose of the meeting (or series of meetings)
A section for each meeting, titled with the day / time the meeting happened.
Attendees, so we know who joined this instance of the meeting
Assignments, so we know who’s doing what. In this case, I’m usually both organizing and chairing the meeting so this is not assigned.
Sections. Each section describes what should exist as a “pre-read”, and a section that allows us to add new information as it develops in the meeting.
Action Items. Specific items for specific people.

Include the right stakeholders

When setting up a meeting it is important to be clear as to who is a requirement for that meeting to be a success, and who is optional. As a rule, I target the lowest level decision maker that can confidently address the problem that I have in scope, and I ask them what support capabilities they need within the meeting. I’ll then add both them, and whatever I need.

Equally critical but less pleasant is who to avoid when scheduling a meeting. There are stakeholders that, while they might have the best of intentions, either do not have the experience in the topic, are ill-informed about the requirements or otherwise detail the conversion.

There’s a last class of people who will in future be part of these meetings. They’re generally on stretch assignments or are being coached to take over. I add them, but inform them ahead of time that the majority of their discussion should be outside this meeting with their “meeting sponsor”.

Follow Up

Earlier, we talked about the utility of a meeting is the amount of change it drives through an organization. In my experience, this change doesn’t happen unless there's a point in time at which the people responsible for it are held accountable — often in another meeting.

Given this, at the end of this meeting once responsibilities are assigned, schedule the next meeting and set the clear expectation that we need to adjust by this point in time.

In Summary

So far, I’ve not seen an organization that isn’t at least mostly driven through a series of check-in, kickoff or accountability meetings through all layers of management — at least, not a successful one. However, there’s a massive range between meetings that are truly useful and those that … well, in which our colleagues are secretly in chat laughing about something entirely off-topic.

Hopefully, this post has provided you with at least some value in thinking about how to drive these meetings more confidently and certainly more effectively.

Post Script. I wanted to write a bit more here, but ran out of time today. I might try and extend this in future — I’ll just update the post in-place.

The problem of SRE vs DevOps vs ... whatever

Andrew howden — Thu, 13 Jul 2023 10:49:31 GMT

One of the well-intentioned but less valuable conversations that tend to happen around the discipline of “Site Reliability Engineering” is “What’s the difference between a Site Reliability Engineer (SRE) and DevOps”. There is an abundance of articles with helpful comparisons such as “DevOps use automation tools like Puppet or Chef to ensure consistency vs. SRE which avoids these as they do not scale, instead using languages like Python or Bash” or “DevOps are the (people) writing code vs SRE which are more investigative”. I usually answer the facetious “I’m an SRE, as that gets paid more”.

I don’t think any of this comparison is helpful. Instead, I want to provide a different way of looking at the problem that makes us much more accountable for improving our organization’s capacity to deliver a better customer experience.

The things we need to think about are:

Mission
Problem Domains
Team Size & Contribution Models

Mission

As far as I can see, the mission of both “DevOps” and “SRE” still boils down the same way — to ensure the customer experience remains reliable and we retain customers’ trust while enabling the business to further innovate on the customer experience.

The Problem

Ensuring the customer experience remains available and retaining customers’ trust requires various capabilities. For example, we need to be able to:

Determine when there’s a customer experience regression. Occasionally this goes under the banner of “Observability”, “Availability”, or “Bounce Rate” — it all boils down to “How many customers are there online now, and are they having a bad time”.
Determine where the performance regression was introduced. This is Observability — the ability to reason through the (usually distributed) system’s internal state utilizing its external outputs. In practice, this means instrumenting the system with logs, traces, and metrics and using whatever active probing tools you need to understand what production is doing.
Respond to that customer experience regression — especially in an emergency. This is usually called “Incident Response”, meaning that an engineer (often with deep system knowledge and specialized training) will jump in and repair it at any given time of day.
Run the software on some computers (generally a public cloud) and expose that software to the world. This is frequently termed “infrastructure”, but in practice means defining what storage, computing, memory and network requirements a given application has and designing a system that facilitates these requirements. Think “Kubernetes” or “Ansible” or “Saltstack” and so on. If you’re really big (or old), this can mean racking and stacking machinery.
Facilitate updates to the software running on that set of computers in a way that allows engineers to push updates to that software without fearing that the update process will result in a customer visible failure and without an absurd amount of work. This is generally “Continuous Integration” and “Continuous Deployment”,; if you’re lucky, with a healthy dose of “Feature Flagging” or “A/B Testing”.
Provides mechanisms to restore data that invariably becomes corrupted due to an accident or malicious intent. This includes things like “Point-in-time backups”, “Offiste Backups”, and the rarely invoked “Restoration Process”.

While we need to provide these capabilities within the bounds of “Site Reliability Engineering”, there are many different ways to do so — each with its tradeoffs and whose utility depends on the context in which they’re being applied. However, there’s undoubtedly a typical pattern that is universal (as far as I can see) across all organizations.

The general path to success

Within any given organization (at least, in my experience), the customer-facing software only constitutes a small fraction of the actual “work” needed to make that customer experience available. Instead, large chunks of time go into the work around the work — defining the application scaffolding, the compute definitions, the release process and the telemetry required for a reliable, trustworthy customer experience but doesn’t make the experience itself.

Additionally, that class of work tends to be executed frequently across an organization with only minor adjustments in how its executed (e.g. Kubernetes versus ECS, CircleCI vs Drone, Lightstep versus Honeycomb). Fundamentally the problems that face each of these applications — the capabilities we need to expose — remain approximately the same. This work goes into a large bucket Amazon calls “undifferentiated heavy lifting”.

This means our capabilities to contribute to the business are enormous within SRE. We can impact every product, project, software or process in the org! There are several common patterns in the work we do:

Reduce the “Marginal Cost” of a capability

The capabilities the business needs to run come at a cost — usually, a cost to the engineering time spent rediscovering an independent solution to these problems. Our first and most crucial lever then as SREs is to identify a problem for our product development communities and solve it so it is as cheap for these communities to integrate and use as possible.

This requires that we identify the right problem. Identifying the right problem can be deeply challenging, as it requires us to leave our preconceptions of what constitutes a “good” design behind and instead work directly with the product community to understand their perspective. Once we deeply understand the problem, we can devise a solution that maximizes our future architectural choices while commodifying that capability for our product colleagues.

“Solving product engineering problems as cheaply as possible” sounds extraordinarily unsexy. Still, it’s essential to realize how powerful a change agent commodifying a capability can be. Reducing the marginal cost of text communication allowed a generation to develop a new language over “SMS”; something that still permeates our culture today. Reducing the marginal bandwidth cost allowed Netflix, Google Meet and the home office. Reducing the marginal cost of finding a good restaurant on a Friday night means a much happier life partner and a better date night. Reducing marginal cost is fantastic!

After that, we need people actually to use our capability.

Standardizing the organization on an approach

While we can build (or buy) capabilities we expose to the organization, the organization only derives value from these capabilities if they use them. To that end, we must work with our product colleagues to use our newly built capabilities.

This can be deeply challenging. What tends to happen here is that we’ve spent a lot of time building a new capability without interacting with the organization. Then when we go to the organization to advertise and integrate our capability, we figure out it’s a bad product market fit. It is critical to recognize this and kill the product that exposes the capability or use the feedback to evolve it until it is a good fit. Simply wishing that the org or the engineers were different and “saw the wisdom” of our approach is a path to endless frustration.

Beyond that, standardizing the organization around a capability is the process of driving cultural change. There are many models to do this, but my preferred (Kotter’s) works by:

Creating a sense of urgency
Building a coalition
Forming a strategic vision
Enlisting a volunteer army
Enabling action by removing barriers
Generating short-term wins
Sustaining acceleration
Instituting change

The practical implementation of these change models is one for another article, but they’re a worthy investment of SRE time.

Identifying and building differentiating capabilities

When working within a given organization, there are usually a series of (ever-growing) commodity capabilities such as “CI/CD” or “Cloud”, and there usually isn’t a significant differentiator between one vendor and another (or a vendor versus producing it in-house).

A set of limited capabilities that aren’t available on the market or that are uniquely leverageable for your organization if they’re constructed especially for your organisation can be created.

It’s difficult to say where a “differentiating capability” comes from. It’s usually some insight that’s either only possible in your organization or only apparent to people who’ve had a unique amount of organizational experience and are in a position to shepherd this capability. Half the time, it comes from a conversation after work over a beer. Examples I’ve seen in the past include Adaptive Paging or Meaningful Availability. These capabilities will meaningfully reduce the cost of something required or provide some compelling, valuable new insight that serves as a competitive advantage for the business.

Building a differentiating capability should generally be done in steps. The most successful versions I’ve seen go through:

Proof of Concept. Usually, one developer with a unique insight implements it in a weekend or hack week.
Rallying a Team. The team of which that developer is a part pitches for and receives a (limited) amount of funding to build this new capability. Or they bury it in other projects.
Proof, product development. The capability demonstrates value, and the organization happily endorses continued product work.

The challenging part of a differentiating capability is that from an outside perspective, there’s very little difference between “a differentiating capability that would provide a competitive advantage” and “a wild idea that has no merit”. After all, if everyone understood it, chances are it would have been built!

Given this, these capabilities take (and should) much work to deliver. They take time, grit and vision.

Team Size & Contribution Models

Suppose we all have the shared mission of ensuring the customer experience remains reliable. In that case, the question becomes how we service the required capabilities—the customer experience measurement, Observability, incident response, deployment, etc.

How we do this depends a bit on our organisational size and culture:

Solo Operator

In a startup (or very small) organization where there’s likely only 1 “SRE” to 5 - 10 engineers, the most valuable way to provide the capabilities is likely a combination of:

Judicious Vendor Selection: By looking at what is in the open market (especially in major cloud providers such as Google Cloud or Amazon Web Services), you’ll be able to expose capabilities for your colleagues that will enable them to solve their problems without relying on you.
Education: Most of your colleagues will not be domain specialists in any of the capabilities we need to solve for — instead, they’ll probably know much more about … whatever the start specializes in. You’ll need to help them learn the minimum required to operate services effectively so they can take up a sustainable load and survive if you burn out or leave.
Review: Exposing the capabilities of major providers to untrained colleagues often allows them to prototype their work — at a cost rapidly. Usually, a cloud bill is discovered way too late in the project lifecycle. You should work to control these costs, and teams do not inadvertently attempt something that will retrospectively be expensive.

Small Team

In a small team (e.g. the standard “two pizza” team), it is possible to build a certain level of redundancy in any given team member. Surviving team members leaving means that the SRE team can take responsibility away from the organization and run it solely within the SRE team.

This manifests itself in a few ways:

Prescribed Interventions: An SRE team can provide certain consulting services to an organization designed to leverage the domain expertise of those SREs, otherwise too inefficient to do within teams. These include things like “production readiness” checks, “postmortem reviews”, or “reliability task forces”.
Productizing a Capability: An SRE team can expose a new capability through judicious vendor selection and recommendations or by developing that capability as a proof of concept.
Centralizing Decision Making. Where a single SRE might need to align decisions across the organization with several stakeholders, a team of SREs can provide a “service layer” through the prescribed intervention or by productizing a capability. This layer of abstractions means the SRE team can iterate internally, improving a capability without needing to align outside that group.

The capability of a single SRE team is limited in what it can solve, so it should be targeted at solving the most critical problems an organization has at any given time and letting the rest go away. This characteristic often means that this is an extremely fun — if challenging — time to be an SRE, as you will invariably be exposed to everything, and it’s always high pressure.

Department

As an organization grows, it’ll become more and more cost-efficient to build some capabilities in a sufficiently bespoke or organizationally optimized way—for example, incident response tooling, observability tooling that meets compliance requirements, reporting or infrastructure processes.

At the department level (i.e. 30 people), it is no longer to scale an approach based on interpersonal relationships (i.e., the “over beers” management style). Instead, it becomes critically important to articulate a vision of what the SRE department is trying to accomplish and to delegate parts of the execution of that vision to independent teams who are tasked with creating and exposing capabilities. Notably, at the department scale, team members can become more disconnected from their product delivery colleagues and hit the earlier struggle of “standardizing an approach”.

The department level also allows the aggregation of a layer of technical expertise (“staff” or “principal” engineering), which can form the “consultation” arm of such a department. These function as a feedback layer between the teams and the rest of the organization, ensuring we do not drift off course.

Organization

As the organization grows even further, the need to build progressively more capabilities grows parallel. To scale the approach, we need to delegate the vision into strategic buckets further each department can own. This requires a broader, less prescriptive vision and a limited set of strategic choices that guide departments in designing their approach.

The scale of the organization also means that it is possible to internal commodity processes to the organisation’s successful functioning — namely, the discovery and articulation of product requirements.

Given this, a well-functioning product process with high-level KPIs, a breakdown into the KPI tree, a process for prioritizing work and managing a portfolio of projects are all requirements to commodify the capabilities a large organization needs successfully.

The Name

The question at the end of all this is: What do we call the people tasked with ensuring we remain reliable and ensure customers’ trust?

Well, call them whatever you like. In my experience, Site Reliability Engineering is a specialist domain within Software Engineering, like Distributed Systems engineering, Architecture, Developer Productivity, React, Spring and many others. Occasionally hiring people who specialise in these is more challenging, and convincing the organization to pay more if we give our prospective colleagues a unique name is more accessible.

Other than that, call us software developers.

Additional Reading

https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started

Simple, Beautiful Software Development

The value of a software engineer

Understanding each other

Understanding the market

Mapping domain specific expertise into software

The skills required to ship code

The skills required to run code reliably

In Conclusion

Learning Path: Software Engineering

LLM Prompt: Writing Styleguide

Education & Reading Level

Styleguide

Format

Fidelity

Precision

Writing a bug report

Title

Story (or steps to reproduce)

Time and Date

Detailed User Story

Screenshots

Technical Detail (Bonus)

Browser Version

Browser Console

HAR File

Conclusion

Further Reading

Thanks

On communicating through writing

The Fundamental Challenge: Developing an Idea

Communicating an idea

Verbally

With Slides

With Video

In Writing

Point in time reading

Writing Tips

Bonus: Large Language Models

Anatomy of a “Good” commit message

The subject line

The Commit Body

General Background

Stakeholder Impact

Design Notes

Breaking Changes

Making that easy

In Summary

Thanks

Should you go to SRECon?

SRECon

Value

People

Meet new people

Grab LinkedIn

Sharing Knowledge

Submit the Call for Papers

Practice

Talks

Seek Value

Thank the Speaker

Sight Seeing

Challenges

Cost

Learning & Development Budgets

Speaking

Sharing

Time

Exhaustion

In Summary

Suddenly, you’re in charge. How do you create direction?

Suddenly, you're in charge. Now what?

The Problem

SWOT

Story #2: Site Reliability Engineering

The Problem

Success Criteria

organisational Theories

Capabilities & Constraints

Tradeoffs

Constraining Choices