I hereby declare this an incident
A magical incantation that instantly grants a team access to the entire disposal of Sitewards expertise to resolve an emergency, but at a…
A magical incantation that instantly grants a team access to the entire disposal of Sitewards expertise to resolve an emergency, but at a cost.
I’ve been a software developer for ~ 5 years now, give or take, currently poking e-commerce stores at Sitewards. In that time I’ve had a progressively increasing level of responsibility from “Can you make this pink?” to “We need to be able to deploy a couple of times per day without risk or downtime to the client”. Unfortunately, not all of that experience has been smooth sailing.
From memory, a short list of the things that I have messed up, or been messed up when I’ve been on watch:
Deleting the entire directory tree of a machine (
rm -rf . /*
)Permanently corrupting and deleting user supplied data during an upgrade
Finding “unexpected access” to a machine
Machines filling their disks and refusing to write to disk, then refusing to boot
Machines failing to reboot after unattended upgrades
Services failing due to resource exhaustion, then cascading failure.
All of theses events end up costing someone a significant amount of money. Whether that’s in revenue lost directly while the service is unavailable, lost trust between the customers and the service owner or lost trust between the service owner and the development team — all of the above are expensive issues that need to be sorted out as quickly as possible.
As the teams I worked with and I progressed through the years, we’ve gotten progressively better at dealing with these emergency situations. They’re a fact of life, and as the projects that we touch become larger and more successful, the need to have solutions to these problems readily available is becoming more and more necessary. In this post, I hope to explore the problems with these situations and the tools that we’ve put in place to help recover.
Boom! An explosion roars in the distance. Panicked clients start calling.
Consider this hypothetical scenario: A customer is experiencing an unexpectedly high traffic load due to a tweet from a celebrity about how good the product was. The client notices the rapid and unexpected placement of orders and celebrates — this unexpected windfall will mean easily fulfilled quotas, increased budgets next quarter and perhaps bonuses for the team!
Aaaand then the service dies. The service stops responding to customer requests, and they’re left impatient and, after hitting refresh a couple of times, simply abandon their purchases. Alerts start surfacing to the developers who specialise in operations, and the customer calls in a panic — they’re watching their unexpected windfall turn into a disaster and there is nothing they can do.
This explosion causes a cascade of problems within the development team:
Interruption
The nature of development work means that interruptions are extremely costly. While developing, we must create a sort of mental model of how the machinery of the service works — tracking how data flows through the service, creating new abstract objects that represent the data at that time and restrict how it behaves and finally compiling those abstractions together in some sort of response.
It’s kind of like immersing yourself into a complex book. It takes an hour to settle into, and when you’re interrupted you can never simply return to that immersion — you must start again, perhaps rereading pages you had already read and falling back into the story. We put a cost of around ~40 minutes on each interruption in addition to any time spent putting out fires.
An incident means continued and regular interruptions for several of our development team. It is a suspension of product work to resolve an issue, and super damaging to velocity.
Knowledge
The very nature of production problems means that a condition has arisen that was not predicted during the development lifecycle. Either some assumption that the developers had made was not correct and must be teased apart, or the system is being used in a fundamentally new way — perhaps simply by undertaking an order of magnitude increase in traffic.
This means that those who built the system are sometimes in a poor position to understand the nature of the failure but are immediately called upon to “do something” to help the service get back online.
Stress
Few things are as infectious as panic. The panicked customer will immediately call the project manager, who will in turn realise that the customer who has experienced large unexpected losses will be unlikely to approve any new budget for the project, and leave the project team out of work. They express this to the development team, who immediately understand that lots of people are upset and the inevitable micromanagement that will result from customers demanding insight into how the issue occurred will cause untold mystery.
Perversely, such panic kills the creative thinking necessary to quickly rectify the sorts of issues that appear in production. Stress focuses us like a laser, and we immediately check the things that we know. But, by nature, the issue is in something that we hadn’t counted on!
Authority
In many cases those who build the systems do not have full access to those systems in a production environment. This is a tricky balance — too little access and they are unable to introspect the application in it’s actual run-time environment. But too much access and we risk exposing customer data either directly to developers who have no need to see it, or to the additional risk of a developers machine being compromised.
Requisitioning authority can take time; by nature we do not want to increase someones access to a system simply because something happened. This time prolongs recovery, burning money and increasing pressure.
Coordination
In a sense, we’re lucky to work in an environment in that when things are bad, everyone steps into help. However, when configuring highly complex environments, this can backfire.
Consider the aforementioned example — the service suddenly stops working. In this case, the service is in a cascading failure; too many requests exhausted a resource, and the failure handling of other parts of the system are putting even more strain on the overloaded component.
Now, consider two developers each who are attempting to fix the issue:
Alice, who modifies some webserver configuration, and
Bob, who modifies some MySQL configuration
Suddenly, the service is alive again. But why is it alive? Who fixed it and how? In this case, neither would have fixed the issue. The server restart for the configuration change dropped all connections, leaving a temporary respite in the server load.
Shortly, the service will again fail, and neither Alice nor Bob will be able to understand why.
An unclear coordination path increases the complexity of issues beyond their (usually fairly simple) immediate causes. It prolongs incident response, and makes forming a clear resolution processes essentially impossible.
Accountability
Lastly, in many cases, one of our more experienced development team members will step in, review the issue, do something quite magical and get the service back up and and running.
Hooray! A sigh of relief floods through the team. The service is back up, the project manager and customer are happy and the experienced team member is a hero.
However, the issue isn’t actually resolved. The issue is likely to occur again — in this system or another. Additionally, over time it’s more likely to occur after the experienced team member has left or is on holidays, and there will not be hero to save the system again.
Declaring an incident
The incident response process is a process designed to help mitigate the above issues. It consists of several parts:
The initial declaration by any member of the team
A series of checklists to designed to ensure that often overlooked items such as notifying the customer or establishing centralised communications do not get missed
A centralised place to discuss the issue (usually slack) that documents all changes
Assignment of specific responsibilities, and restrictions on stepping outside those responsibilities
A published document called a “post mortem” that investigates the issue in a very structured way. It includes creating tasks to help address problems discovered during the incident.
A meeting scheduled 30 days after the incident to verify that the issues identified are resolved. If they’re not resolved, the meeting is scheduled every 30 days thereafter until they are.
What is an incident?
There is no formal definition of an incident. Any member of the team can make that judgement call themselves. The incident process is inherently expensive — it’s optimised to restore service extremely quickly, at the cost of interrupting at least 4 members of the team (usually more) and costing many hours of investigative work.
However, a rough heuristic is that “if it affects users it can be declared an incident”. Some examples include:
The checkout is broken
The service is entirely inaccessible
Unauthorized access is discovered on a node
A process that is corrupting data is identified
The above are all things that are going to cost someone a substantial amount of money — they need to be sorted out right now.
Declare, Delegate, Document
An incident begins with the specific phrase:
“I hereby declare this an incident”.
At first, it sounded ludicrous! but it turns out that when people are not sure when there’s an emergency they’ll hesitate to use the language forceful enough to convey the emergency nature of the problem. This eats into the recovery time — time simply spent clarifying whether emergency action is required for the emergency that is happening.
Compare:
“The disk is full on www1!”
to
“I hereby declare this an incident”
In the case of the former, a project manager is going to think “…. uhh, what?” In the latter, the much more accurate “… ahh fuck let me find ${CTO}
” will flash across their mind.
Immediately following the declaration the person who delegated must assign some people the roles defined below to help. This happens before any other action as the roles that are being assigned are also time sensitive and must be auctioned as quickly as possible.
In the case of a developer discovering an issue, they must stop investigating and assign roles before they are allowed to continue. Otherwise their investigation will not be coordinated, the client will not be informed or perhaps the issue will be buried.
Incident Command
The incident command is the “default” role for declaring an issue. Once an issue is declared, whoever declared that incident is incident command. Their responsibilities include:
Creating a centralised place for people to coordinate around the issue
Assigning the other roles required to assist in the issue resolution
Summarising the issue in the centralised communication channel
Closing the incident once service has been reliably restored
Preparing the post mortem for review by the team
Enforcing attendance at the review meeting following the incident closure
At the time of the incident, the incident command has authority over almost all other areas of the business.
Incident command cannot perform the tasks of other roles. This includes operations — one cannot be in incident command and investigate or directly communicate with customers.
An incident command role is usually a senior developer with experience in being part of incidents. This allows both the perspective required to understand the issue and the authority required to requisition the things needed to resolve it.
Operations
Operations is the simplest role to define: investigate and resolve the issue. There are allowed to be many operators however they must coordinate their actions through the incident command, and must document the measures they take in the centralised communications channel.
Communications
Among the most corrosive effects of an emergency is the loss of trust between customers and the service owner, and the service owner and us as the development team.
The role of communications is to follow an established pattern of notifying all relevant parties of the issue. This includes:
Notifying the service owner that we have declared the incident and are responding
Notifying the customers that we understand there is an issue and are working on it
Notifying internal teams that their help may be shortly required and to be on standby
Continually updating all relevant parties until the issue is resolved
The communications role is critical to ensuring that the recovery process is as transparent as possible, and that we assure all involved that we are taking the issue as seriously as the structures above would imply.
Simply setting expectations up front mitigates some of the corrosive lost of trust that will otherwise manifest.
Planning
Once the incident has been declared closed by the incident command, the planning role is required to integrate the issues that were identified during the post mortem into the development workflow.
Post Mortem
At the conclusion of an incident a document called a “post mortem” is prepared. This is a structured document detailing:
An abstract of what happened
Who was involved
A timeline of the events
A review of how the incident affected stakeholders of the service
A summary of the issues discovered during the incident
Links to the resolutions in the associated bug tracker
Attached emails/screenshots/other evidence useful for understanding the issue.
It is primarily used as a teaching document. Incidents are invariably things that do not happen on a daily basis, and things that we did not anticipate. Accordingly the post mortem is designed to help us fold into the project teams the knowledge required to predict, resolve and prevent this type of issue from happening again. Additionally it’s used as the primary reference documentation for the incident, and would be useful in the case of legal proceedings, disciplinary actions or any other consequential process.
The document is unusual in that it takes a structured approach in a couple of ways:
It’s blameless
We essentially start with the idea that all that work with us are well intentioned good people. As covered in the next section, the goal is not to find a single root cause (be it a person or event) but rather to have a frank and honest conversation about the issues that led up to the incident. These issues can be wide ranging, and while are usually technical, are also commonly issues around the allocation of budget, inspiration through the team or communication failures.
The post mortem does not refer to people directly, but rather substitutes variables in for that person (Operations person A, Incident Command, Planning, Service Owner). Though the variables are associated with actual names in the “People” section of the post mortem, distancing the issues from the people removes incentives assume blame, take credit or otherwise assign more agency than necessary to those involved in the incident.
That’s not to say that it removes accountability. Indeed, part of the post mortem and the review meetings are designed to ensure that the issues discovered are tracked and resolved, and that issues cannot be “buried”. Responsibility for ensuring the issues are closed sits with the planning role, and more broadly on the team responsible for the project.
It’s designed to avoid a root cause
During an incident there is often a single technical solution that is the difference between having a service running or not. In the case of the most recent cascading failure, reducing the number of web workers will prevent the socket pool for MySQL filling up and applications continually polling to reconnect.
However, it is rare that there is a single technical problem. More often there is a series of missteps, both in terms of managing the technical implementation, time, budget and people that lead to the issue. The post mortem is designed to pull as many of these to the surface as possible, where they can be tracked in the issue tracker and resolved during the course of normal work.
In this way we can avoid not only this specific issue again, but perhaps a whole class of issues across this and other projects.
Review
Amusingly, immediately following an incidents closure any incentive for making permanent fixes for the issues discovered goes away. Once the pain is gone, other issues start gnawing at our ankles and we are dragged off into other things — the issues identified during the post mortem eventually become stale, never getting resolved.
The review process is designed to keep the pain of the incident fresh, and keep the team responsible for the project accountable for fixing the issues that are identified in the post mortem. By being clear at the start of the incident that the issues identified will need to be resolved (or the meetings will continue) we can ensure that project teams are incentive to resolve the issues in a more permanent way.
The review process consists of a short meeting to discuss the issue as it was identified, determine whether all of the required issues were fixed (or marked as “wontfix”) and that the incident can be closed.
Closure
Once the incident command is comfortable that the issues identified have been resolved, the incident is closed. The incident team celebrates, and moves back to project work. Postmortems are again distributed throughout the company such that other teams can learn from this failure.
In Conclusion
Having an incident response process initially seemed like an unnecessary overhead. It’s expensive, requires a strict set of procedures to be obeyed and doesn’t allow the flexibility to solve issues usually valued in agile organisations.
However, the structure gives a sense of security to the incident teams as they’re investigating, relieving them of pressure enough they are able to think creatively about how to solve the issues. It forces us to communicate to all parties in a clear way that ensures that all stakeholders have a clear understanding of the issue, and provides the necessary feedback loops to ensure that once incidents are closed the issues identified are not buried to be discovered by another unfortunate team on another day.
So, a magical phrase that let’s the development team get the things they need immediately — but at a cost.
Thanks
Haroon Meer, who’s critique of these inspired me to document the above. We’d suffered some of the same issues he mentioned, and I wanted to share how we deal with them.
Antonius Koch, who read a draft and inspired me to do final editing and publish.