Establishing a shared understanding of extremely urgent, critical business issues in software
A case study of our collaboration with Maschinenring Deutschland GmbH designing the “incident response” process
Our collaboration with Maschinenring Deutschland GmbH designing the “incident response” process
At Sitewards, our job as a software development agency is to work with our project partners to develop software that will allow them to help solve their business requirements more quickly, cheaply and reliably. We do so by developing:
Software Applications
Infrastructure
Monitoring / Analysis tooling
Project workflows
that we tailor to each project partner.
However, though we take a large amount of care with our development, occasionally issues arise that cost our project partners significantly. In those cases we want to help our project partners restore their services as quickly as possible so as to mitigate any lost money or user trust.
Over the last few years Sitewards in conjunction with our project partners Maschinenring Deutschland GmbH have developed a framework called “Incident Response”, subsequently refined and applied to all of project partners to balance the need to urgently address issues with the need to maintain a steady and efficient software development process.
Issue critical mass
What is a critical issue depends on the nature of our partners’ business. Maschinenring is a membership organisation dedicated to helping farmers support each other. They provide services such as:
In-depth weather information
Special discounts for new cars and bulk purchase of petrol
Connecting farmers with opportunities to lease their equipment
Connecting farmers with other industry partners for services such as snow clearing
Emergency help
Given these services, the following things are treated as “critical”:
Any interruption of service availability or accuracy of the “weather ”page
Any issue with members being able to access member services
Instance unavailability
However, what is and is not an incident is ultimately the domain of Maschinenring itself. In one particular case an issue that was not a large issue in terms of the Software was a critical problem for Maschinenring — the diesel order form.
The diesel order form failure
On May 5th, 2017 we received a message from our project partners that the diesel order page did not work correctly. At first it wasn’t clear that this was a critical issue; in terms of the software, the diesel order form was thought to be a small, simple and fairly unimportant component. However our project partners quickly corrected our misunderstanding; the diesel form had significant user value and needed to be fixed as quickly as possible.
An incident is born
Once we understood the criticality of the issue an incident was declared:
Declaring an incident is a process that signifies that something is happening that overrides the importance of every other process in the company. Meetings are suspended, team members called back from lunch, coffees emptied (or refilled) and a team of experts assembled from across the company with the singular aim of restoring the broken component.
The team is broken down into several components:
Incident Coordinator: A team member who ensures process is followed and things are documented.
Operations: Team members who are tasked only with investigating and finding temporary fixes for the issue
Communications: Establish a two-way dialog with project owners, customers
This team works according to the incident response process, coordinating via central communication channel (usually slack) until a temporary workaround is found to restore the service:
Once the system is stable in production, the majority of the incident response team disbands again leaving only the incident coordinator to write the post mortem.
In this case, the incident was declared at 14:49PM and resolved for users by 15:14PM; a total of ~25 minutes.
Learning lessons
An incident is, by definition, an unplanned emergency. These emergencies cost a significant amount of money:
Service unavailability means users cannot derive value from the service. Unhappy users means lost revenue for the business
Incident response is an “all hands on deck” approach to fix an issue which resolves the issue quickly at the cost of involving many expert developers.
Accordingly the goal is to reduce such emergencies down to a more appropriate “1x / 3 months”. More means the system is unstable and the development team cannot be efficient in their development and less means the developers are being too cautious with their development and perhaps leaving opportunity unfulfilled.
To ensure that the incidents are reduced to an acceptable frequency and decide how much should be invested to prevent such issues in future the incident is studied in some depth in a document called a “post mortem”. This document has the goal of:
Surfacing application behaviour that did not function as anticipated
Surfacing technical, organisational or economic patterns that encourage critical failures
Surfacing high risk areas of the application
Each contributor is given a bug in the bug tracker and must be followed up by the normal project team, either dismissed explicitly or a solution implemented to prevent the issue in future. In the case of this issue, the follow up tasks were:
Missing test data for the Diesel feature
Including testing Diesel feature in “smoke” tests
Improving our ability to surface critical issues
Resolving the bug associated with
Fields
Meetings are scheduled every 4 weeks until all factors are either marked as “wontfix” or “fixed”.
Putting incident response into practice
Designing the incident response process has been an collaboration between Sitewards and our project partners. Since its initial design in October 2017, we’ve refined the process to improve communication, allow easier follow up and ensure that we’re being as transparent with our investigation and resolution as possible. It has been used successfully to diagnose and remediate:
Issues with the “Weather” service
Hardware failures
Database overload
Cascading failure
Login failures
Application bugs introduced during development
Following the implementation of incident response and the required follow up, critical issues were solved significantly faster than they were previously and after an initially high amount the number of incidents significantly reduced.
The system is now more stable, developers more aware of the business critical functionality and our development process modified to systematically reduce the risk of further outages. The incident response process has been taken from its initially designed implementation between Sitewards and Maschinenring to an organisation wide process for handling emergencies.
Lastly, the team at Maschinenring know they can get an immediate response on an issue when required simply by invoking the incident response process.
In Conclusion
At Sitewards, we take pride in our software development and continually try to ensure that the software we built is the best we can make it for the constraints that each given project provides. Unfortunately even the most carefully written software can behave in unpredictable ways, however Sitewards and Maschinenring have been able to establish a process of escalating critical issues such that they get the attention they deserve and disruption to users is kept to an absolute minimum. This process has been implemented for approximately 2 years and has demonstrably reduced the time required to resolve critical production issues as well as helped contribute to the ongoing stability and predictability of our software.
Additional References
(Website): Report from SNAFUcatchers workshop on coping with complexity
(Video): John Allspaw on keeping systems running day after day
(Website): PagerDuty incident response documentation
Thanks
Marcus & the team at Maschinenring for their feedback on the incident response process
Antonius & the WAT team at Sitewards for helping design and implement the process
Sitewards for their broad adoption of the process