Explainers: Instrumenting your service
Explainer (noun): a statement, article, or video that provides an explanation of a concept, topic, situation, etc. —…
a statement, article, or video that provides an explanation of a concept, topic, situation, etc.
At Sitewards, we routinely develop and deploy internet facing web applications. We manage about ~50 different web properties of one kind or another, all of which are up and facing users.
Unfortunately, not all of them work 100% well all of the time. It seems to be a sisyphean task to ensure things are in perfect functioning order. There are a variety of reasons why things go wrong:
People use the site in ways that we did not anticipate
More people than we planned for use the site and it runs out of capacity
Some underlying piece of software or infrastructure fails
Diagnosing and fixing these issues in a timely way can be tricky. Especially when there are so many! Far beyond that would be possible to manually review.
Luckily, we’ve gotten pretty good at instrumenting our things such that any problems are automatically picked up and surfaced to us!
Instrumentation (noun): may refer to a device or group of devices used for direct reading thermometers or, when using many sensors, may become part of a complex industrial control system in such as manufacturing industry, vehicles and transportation.
When we deploy an application, we do not just deploy it and pray it works. Instead, we collect a whole swathe of information from it on an ongoing basis. This information is modified to preserve the privacy of users and then stored for a period of time for later analysis.
There are a few different types of instrumentation, but the most common in use here are:
Generally speaking the applications that we administer are many thousands of lines long. One common application, Magento 2, is over 800,000 lines of code! Throughout this programs life it makes many thousands of decisions.
When something is not working as expected this can usually be tracked down to either a decision being made incorrectly, or something flawed in the way the decision is written.
We get an understanding into how the application us running by noting the “significant” decisions or errors an application makes in a text file as well as the time this occurred and other relevant information. These can then be investigated when an issue is reported in the application, giving us the retroactive ability to look back in time at how the application performed, and understand the scope of the flaw.
Additionally, the logs from many applications can be collected together and compared, giving us an excellent picture of how the application is performing in production.
An example log is below:
cd6dc4d9688a449e8d27a1ef9e375ea1 — — [16/Mar/2018:13:43:00 +0000] “GET /en/shop/product-detail/?getspecpdf=1 HTTP/1.0” 200 550267 “-” “-”
Time Series Data
Unfortunately, it is difficult to get a general sense of how an application is performing simply as logs. For example, if an application is very busy, simply logging:
“Hey I am very busy!”
Is not going to provide us with much in the way of information. Instead, there is a whole swathe of information that is better to represent numerically. In this way we can determine not only whether the application is busy, but also how busy it is. Generally this is represented on a scale between 0 and 100, where 0 is “totally idle” and 100 is “oh god”.
All sorts of things can be represented as numbers:
How many requests a website has had
How many of those requests did not go correctly
How much capacity an application has used
Additionally, these numbers can be represented as a graph. The example below is one I took from a particularly troublesome machine today:
Time series data is also unique in that it allows us to put specific bounds on what is “acceptable” behaviour. For example, we can write the rule:
When any application no longer has any space to write to it’s data storage, send a notification to the developers pager indicating that something is badly wrong with this application that requires immediate attention.
Though we try and keep these alerts to a minimum, when they do go off we are able to quickly act to preserve the application and it’s data and minimise any time the application is unavailable.
Transaction traces are perhaps the trickiest instrumentation method to understand.
Broadly, when a user makes a request of a website it will travel through a large chain of decisions before returning the response to the user. Each request and response is sometimes called a “transaction”.
So, a transaction trace is a snapshot of that entire request, including how long it spent in each particular decision as well as the final decision of the request. This detailed analysis allows us to quickly find “hot spots” where applications are spending an unacceptably long time trying to make a decision.
Others that are only used on an ad-hoc basis or are less common include instrumentation of the interface between the application and the kernel that operates the computer machinery (system calls), capturing and reviewing content as it moves across the network (packet capture) or application core dumps (perf/gdb).
It’s somewhat hard to grasp how this data may help us determine issues. So, let’s follow it up with some examples:
Core bug in Magento 2 checkout
Given the log below:
Mar 15 12:04:15 -nginx: 2018/03/15 12:04:15 [error] 1400#1400: *5761427 FastCGI sent in stderr: "PHP message: PHP Fatal error: Uncaught TypeError: Argument 1 passed to Magento\Quote\Model\Cart\Totals::setExtensionAttributes() must be an instance of Magento\Quote\Api\Data\TotalsExtensionInterface, instance of Magento\Quote\Api\Data\AddressExtension given, called in /path/to/app/vendor/magento/framework/Api/DataObjectHelper.php on line 125 and defined in /path/to/app/vendor/magento/module-quote/Model/Cart/Totals.php:592
We can quickly determine that our application is occasionally acting improperly. This was lodged with Magento 2, and was fixed … sortof. They missed it in the last release!
Practically, this means that during the checkout some users are getting an error. This is not an error they can do anything about, nor one they can fix by (for example refreshing the page.
The disk is full
With one of the recent Magento updates, a change was made to how images are processed. This was to prevent a security issue in the application and seemed like an excellent idea. However, it reacted badly with an import process designed to import thousands of images. Because Magento needed to reprocess images, all images must be dowloaded 2x. This quickly filled the disk.
However, a rule that a thoughtful colleague had earlier inserted picked up the disk when it was 93% full. The process was paused, and the error identified and resolved.
Practically, a disk full has all kinds of weird, data corruption effects on data on the machine — the most common of which is simply to overwrite existing data with no data, permanently destroying it. In this case, we avoided that.
CPU maxed out
During a periodic review of all of our systems, we noticed that one system was under a large amount of load. We did some investigating, and noticed that the load only started a couple of days ago, and while it was not affecting users, this was strange enough that we investigated further.
It turns out that because of a requirement unique to this system, another system was making a large amount of requests against it. These requests were expensive, and the machine struggles to cope with them.
The response times of the application are still within the required levels of performance, so we’re not too worried about this one at the minute. However, we’ve sent an email to the third party system, and we’ll keep an eye on it going forward.
Practically, this means that if the website gets busier, then it may fall apart in a couple of ways. Users will visit the site and be left waiting for a response that the server is too busy to give!
Machine got hacked
Perhaps most excitingly, our automated tooling picked up a nefarious user breaking into a third party machine. Now, I need to stress here we picked up and re-mediated the hack extremely quickly. Further the nature of the hack meant that no personal data was exposed. Still, we’re well prepared for such an eventuality and after performing a thorough analysis we discovered and plugged the leak and rebuilt all of the foundations of that system from scratch.
This sobering lesson started from our instrumentation picking up an unscheduled reboot, violating a condition associated with the collected time series data. Routine investigation showed some discrepancies in the data following the reboot, and we discovered the bad actor.
Practically, this preserved users privacy by allowing us to kill the hack before unauthorized people were able to access user data. Additionally it may have saved the client millions of dollars in fines, legal fees and lost business due to a loss of consumer confidence.
Whether for good or bad, software does not always do as it’s supposed to. However with some forethought we can get a large amount of information to help is debug the issue as it happens, and restore the application to it’s proper functioning as quickly as possible.
Antonius Koch, for his review and feedback