Discover more from Simple, Beautiful Software Development
Help I'm now on call!
What to do if you find yourself responsible for production systems and you're not sure entirely what that means
Let’s imagine for a minute that you suddenly find yourself being asked to go “on-call” for a given production service. You don’t quite know what being “on-call” is, except that many senior engineers do it, which seems necessary. You’re excited that someone asked you to be on-call, but you’re not sure you want to take on that responsibility and are worried about how it will affect your family time.
I have some excellent news: This makes you Perfectly Normal. Being on-call is a weird situation! I’m writing this guide for you, hoping you become more familiar with this responsibility and more ambitious to take it on yourself. It is an excellent way to further your engineering career, learn about production systems, and take responsibility for the customer experience.
I’ve been on call for most of my software engineering career. I’ve gone through many, many iterations — from being “unofficially” on-call (read: always on-call without compensation) to being part of an engineering team that’s set up and is managing a rotation to designing incident response processes to being the “Incident Commander” for a billion dollar European fashion company. I’ve felt almost every part of being on-call, from the “Oh god” moment of breaking production systems, the 3 am “Not this bug again 🤦” drag to the “Oh buddy! Don’t worry, we’ve got this” moment helping a responder recover from their production challenges.
To best prepare for 24x7, we should first understand why an organisation maintains this capability.
All modern software businesses need on-call.
The key to successful on-call is preparation. Figure out your responsibilities, learn the process, figure out your surroundings, understand how to debug and adjust your service in production, understand the significant projects happening and ensure your equipment is prepared.
When the pager goes off, triage the issue. Figure out what is happening through the telemetry you studied earlier, and work with teams around you to figure out an intervention. Try it, and then go on. Be sure to communicate clearly as you’re responding.
When it’s all over, go through the process of learning from your experience. Document the impact, what happened when a causal tree of anything interesting and then a summary for executive readers.
Through preparation, you’ll be fine. Try it!!
You read the TL, DR! That’s basically the critical bit. Want more TLs and DRs? Subscribe now!
The ever-present service offering
Modern internet-based businesses are expected to be available at all hours of the day. In many cases, the time at which the majority of customers are accessing the service is outside the hours of those who are working on that service. A business might serve the vast majority of customers when no one is looking!
This means the riskiest, highest profit period is often when no one is at their desk! And if something breaks, the business is burning trust with far more customers than might have happened during the day. This is an untenable outcome for many companies; they will not survive burning their customer's trust in this way. Given this, it is essential to have someone tasked with responding to issues as they’re reported. As a result, if a business's customer group is primarily anchored around business hours, it's often not worth maintaining the 24x7 capability. It might still be worth having an on-call team — but one that operates close to business ours (e.g. 08:00→ 18:00).
In my experience, the most effective on-call teams operate in groups of 5 - 6 people drawn from teams that maintain the application code and the infrastructure definitions of a given application. These team members are usually “senior” and have experience with many different states of the application (good and bad), and can debug an extensive range of failure modes. They often occupy other senior positions, such as writing the architectural guidance of a given application or making tradeoffs around technology choices. They work in shifts, typically for a week and are on-call once every 5 - 6 weeks (in rotation with their colleagues). They carry a dedicated mobile device on a high-quality network during this period. They can be in front of a computer debugging a production issue within 5 - 30 minutes of receiving a notification.
This is a substantial commitment from these colleagues; they should be financially compensated — especially if it is a commitment beyond the normal expectations of software engineering.
> 💡 If you’ve never done on-call, these colleagues can be intimidating. They are usually very experienced colleagues with opinions based on that experience. They can be stubborn about an approach because they ultimately pay the price for poor outcomes. However, it’s important to remember they’re ultimately human. The best way to join them is to try being on-call!
The problem not solved.
Today, we’ll discuss joining a healthy, high-performing team as a new on-call colleague. Setting up a new on-call team is considerably more challenging. Let me know if you’re interested in this in the comments section!
Preparing our gear
We usually think about being on-call as responding to production issues. However, in my experience, the key to being an influential on-call team member is the same as being an effective fire marshal or first aider — preparation. What happens in the incident itself is driven by how much time we spend in practice and not based on the skill or intellect of any given responder.
Things that you should look into before you go on call include:
Figure out your responsibilities
There are usually multiple on-call teams in any large organisation (i.e. more than 50 engineers). Each of these engineers is responsible for a subset of production systems and works with systems run by other teams. The first thing to look at is what your team is responsible for. There should be a list of the following:
Applications (binaries you have running in production)
Endpoints (DNS endpoints and routes you have exposed)
If you’re joining a team that has been there long, you may have to write or update this list. That’s fine — it’s an opportunity to learn the system's boundaries. You should figure out the stakeholders of each application, endpoint or business process and what the value of that process is to the business. This allows you to determine how urgently to intervene and how much risk you should take with the response.
💁 I have a series of “canned responses” for various failures that I can copy-paste into chat, email or other communication tools as they happen. They’re just stored on disk, and I get them through “cat | xsel –clipboard”. I find these extraordinarily useful to get some critical information out quickly and clearly in a way I otherwise can’t while debugging these systems.
Learn the process
Organisations tend to have a clear boundary of separation between things that are “kind of bad” and “incidents”. Incidents are extraordinary — we are encouraged to drop all other work no matter what time of day it is, we can requisition colleagues immediately, we’re prepared to accept more risk deploying changes, and we communicate in ways that might otherwise be considered rude.
We must delimit when entering “Incident Response” versus any normal part of the software delivery life-cycle. How to do this varies depending on the organisation, but it could be:
The creation of an “Incident” artifact in something like OpsGenie, PagerDuty or Jira
Using a unique phrase (e.g. “This is an incident”) in a chat tool
Creating a new thread in a chat tool in a particular channel, or a channel dedicated to the purpose
Frequently there are “chat ops” tools that make this transition easier. Once you’re in an incident process, there are other tasks that you’ll need to learn how to do to make the process run smoothly. Things like:
Involving another team if you figure out you need their assistance to resolve the issue
Update a status page to let non-technical stakeholders know what the status of the problem is
Notify colleagues of discoveries or interventions to collaborate with them on finding a solution
Mark the problem as repaired once you’ve made the system stable to return to “normal operations”.
Figuring out this stuff beforehand will save you enormous stress in the incident and allow you to quickly find the people you need to help you mitigate an issue.
💁 Recently, I’ve seen quite a few people struggle to interact with the “ChatOps” tools, or if the “ChatOps” tool is unavailable, be unable to manage an incident. This costs us time, which in turn costs us money. Try and get very familiar with these tools ahead of time!
Feel out your surroundings
Once you’ve figured out your responsibilities and how to interact with the process, the next thing to do is figure out the duties of those around you. Teams that operate services on which your services depend, as well as stakeholders that are likely to reach out if they notice an issue. People who need to be notified in case something in your responsibility is unavailable.
In particular, you want to know how to communicate with these people in an emergency. If it's another on-call team, figure out what the team is called and what operation you need to do to include them in the response. If it is a stakeholder, figure out how they prefer to receive emergency updates and write a process to deliver it as they expect.
Lastly, please get to know these people and their perspectives. Meet them, and introduce yourself and how you will work during an incident process. Sit with them and review how they view their applications, tooling or function so that you can understand their position during a response.
This will allow you to build a mnemonic or playbook to manage interacting with the “in-anger” that balances what they need with the urgency of response. This also goes a tremendous way to salve feelings if you inadvertently communicate in a terse (or rude) way during an incident due to stress.
Understand your service in production.
So! You’ve figured out your area of responsibility, the process requirements and who you’ll be working with. The next thing to do is build expertise in the software or infrastructure that’s within the domain you’re responsible for. After all, you’ll need to debug it when something goes wrong!
Some of the ways I’ve done this in the past include:
Review all of the playbooks. Any playbooks written about the service are reviews of previous failures that someone has been kind enough to write guidance for! They are invaluable as both a preparation and response tool.
Review the dashboards, logs and metrics. The telemetric data the application generates is the same data you’ll need to rely on when the system malfunctions. You’ll need to be familiar with it and be able to spot deviations from regular traffic.
Review the configurable aspects of the application. The application configuration is “anything that can be changed outside a deployment” and includes anything from the number of replicas to whether or not a specific feature flag is enabled. You should be able to look at the current configuration, a history of how the configuration has changed, and confidently update values in the configuration.
Deploy the application. Sometimes, when something goes wrong, the only thing that will address it is a change to the source code. Whether this is a rollback to a previous version or a “fix-forward” where you’re writing and merging a patch, deploying the application is a skill you’ll need to be familiar with.
Review previous incidents. Hopefully, as our colleagues have had challenges with our applications, they’ve improved it so we do not have repeat issues. Still, incidents are beautiful opportunities to learn how our software deviates from what we expected. Reviewing past incidents can help us understand where the “sharp edges” of our systems are.
Once you’ve tried these tasks, I encourage you to try and improve them. Try and write a new playbook for a failure mode you anticipate, improve a telemetry view or graph or improve the default for a configuration. As you drive these improvements, your conversations with colleagues will teach you more than just the artifacts.
Understand the significant projects in your area of responsibility.
One of the things that you’ll quickly learn as you go on call is the majority of issues with production systems are a result of some change that has been recently introduced. This is normal — software can be incredibly complex! It’s challenging to anticipate every change's consequences; the more significant the change, the larger the space for unanticipated consequences.
An excellent way to get ahead of the “likely future failures” is to keep an eye on things that will introduce significant change. Things like:
Substantial changes to the customer behaviour
Significant architectures of the system
Major upgrades of runtimes, libraries, frameworks etc
Rewrites of service behind an API
Large projects also tend to run overdue, so engineers are more likely to sacrifice reliability work to get the project delivered. As an on-call team member, your task is not to get in the way of such a release but to empower the team as much as possible to make the release safe. You can go a long way toward safety by ensuring the team has a clear plan to roll back the change if it goes wrong and when to invoke this rollback plan.
Even with the owning team taking a healthy level of responsibility for a change, changes tend to go live during “quiet periods” of the day, and the on-call team members are responsible for that exact change during the busiest, highest impact periods. You will likely be responsible for adjusting or rolling back this change during the response. You should prepare ahead of time by making sure the exit criteria and path are clear enough that you can execute them.
Get your hardware & software prepared
Lastly, for preparation, there are things that we need to have to interact with production systems. These include
A functioning laptop. Ensure your laptop is charged and all updates are installed before the shift starts.
A functioning phone to receive notifications. As with the laptop, ensure it's charged and updated.
Chargers. Sometimes something goes wrong for a sufficiently long enough time that our devices run flat. We need to be able to plug them into a socket.
Network access to production. Unless you’re sleeping at the office, you’ll likely be at home when you receive notifications! You should ensure a stable internet connection and, ideally, a cell-based backup connection.
Permission to access production. At any modern corporation, permissions tend to degrade over time. Either a system will make you “less privileged” or remove your access to some systems entirely. This is a good and healthy implementation of the principle of least privilege, but we need to validate we have sufficient permissions before we go on shift.
If you’re a new colleague, having a checklist for these things that you physically mark off before the shift is good. More experienced colleagues tend to maintain this out of habit — but also, occasionally, forget.
The unexpected adventure
So! The day has come. Hopefully, you prepared as much as possible, but finally, something has broken so severely that it requires emergency intervention. Let’s talk about what to do next.
💓 When you get that page it can be extremely stressful. All the more so if things do not go smoothly; maybe you have the manager asking questions and your teammates disappearing or maybe you’re alone at 04:00.
This is normal. It gets better after a few live incidents, but never entirely goes away. Take a minute to take a breath, compose yourself and then continue. No matter what happens next, you’ll be at your best with a clear(ish) head.
Triage the issue
The first thing to do is understand the issue's scope and magnitude. That lets us decide what level of risk to tolerate when trying to fix it and how many people to wake up or assign to help address it.
Often, the incident process has multiple “levels”. If unsure, pick the more severe one — you can apologise later. Raise the incident at the appropriate “severity”, and communicate your understanding of what users are experiencing.
Review the telemetry data from your application and leverage your engineering skills to understand the pathological condition.
The specifics of debugging production depend on the architecture of your system. There are many good heuristic models to understand systems (e.g. USE), but debugging is outside the scope of this article. Instead, what I want to mention is just that you’ll need to go through the OODA loop:
Observe: Figure out what you can understand by reviewing the telemetry from the system.
Orient: Make that valuable information by leveraging your knowledge of the system and its context (e.g., is it a sales period or a significant release today).
Decide: Figure out what to do with that information.
As you’re going through the steps, periodically communicate. Most useful are if you’ve learned something new, you’re about to act, or you just acted. When you communicate, communicate statelessly. That means don’t assume that anyone reading your communication has been following along so far, understands the system or follows your hypothesis. As you’re supplying numbers, be specific (i.e. 300ms) rather than relative (“Huge latency”). Lastly, be kind but direct — be clear in what you’re saying, but don’t spend time with linguistic polish.
Lastly, focus your investigation only on how to restore the production experience. While how production broke is an important question, it is a question that’s distinct and less useful than how to restore the production experience.
💁 I’ve seen quite a lot of incidents in which a change has materially changed the performance characteristics of an application. We don’t usually figure that out until peak traffic hits (or a sale). In that case, the question is not “What made the application slow”, is “How far can we scale this application out? What are the bottlenecks?”
The latter is much simpler and will lead you to restore production much faster.
Once you’ve figured out a good intervention, do it (and communicate it). If it doesn’t work, roll it back, share that and try again. If you get stuck or need help from another team, include them!
Hopefully, at this point, you’ve been able to make some changes to the production system so that the customer experience has been restored. You’re not quite sure why the system broke, but it's sufficiently stable that you can breathe.
Now, see if you can figure out how stable the system will be over the next few hours. Your goal now is to buy time to hand the problem back to the owning engineering team so that they can deal with it properly. If you need to intervene further to ensure this stability, do so, and communicate that.
If you’ve made changes to configuration, code or anything else, ensure these changes are reflected in version control. Having someone deploy and undo all your precious recovery doesn't feel great!
Once you’re confident the system is stable, mark it repaired and communicate that you do not expect further issues. Write down everything you think is relevant in some notes for reviewing the incident, and step away from the keyboard for a bit.
📱Take your phone! While most systems behave after we’ve intervened, sometimes they do not. If they break again, you will be paged again.
Learning and improving
Whew! So far, we’ve gone through preparation and then leveraged that practice to restore our production issue. All done or?
So, as you might guess, you’re not off the hook just yet. Once the system has reached stability, we need to understand how it got into its failure condition so we can adjust it to be less likely to do that in future or adapt ourselves and our processes to restore the system more quickly. We go through this understanding through something called a postmortem. Check out the linked post for details!
Being on call can be a pretty hectic experience. In this post, we covered the background of being on call, the preparation, what to do in an incident, and what to do after. I’m confident that if you go through this preparation yourselves, you’ll be far more capable of helping improve your user's experience than you would be if you just gave it a go. Hopefully, this post gives you tips to feel more confident as you take that pager.
If not, hit me up in the comments so I can help improve it. Thanks ♥
Gosh, that was a lot! If you’ve made it this far, congratulations 😀 Maybe you’re really enjoying these posts. Or maybe you need to be on-call soon. Either way, subscribe for more helpful stuff!