Transaction Tracing in Magento 2
The vast majority of my development work is dominated by an application called “Magento”, specifically “Magento 2”. Magento is famous for…
The vast majority of my development work is dominated by an application called “Magento”, specifically “Magento 2”. Magento is famous for it’s community outreach programs, one of which some of our nicer colleagues at Comwrap put on.
The day essentially amounts to tackling issues on Magento 2. Honestly, an odd premise — working on an open source project for free, for … reasons. However, I found it a good excuse to catch up with a former college and talk shit, and surprisingly, it was quite fun.
Let’s do a thing
The Magento community engineering team has projects setup that allow multiple developers to coordinate together to ship a feature. The comwrap team is shipping a particularly exciting feature called “Async API’ — An API that allows deferring the expensive validation and write of an API until after the application has accepted the API payload.
However, while this will surely be handy, it’s not a project that I found particularly exciting. Instead, I wanted to take a look at something called “Transaction Traces”. The team welcomed the idea, and I got started.
Transaction Tracing
Transaction tracing arises out of the problem of tracking a request through multiple services. Given any given user facing services, an application may execute as many as 100 different services, threads, RPC calls etc.
Transaction tracing is, as it sounds, the ability to track a request through all of these different applications and determine from it performance information such as “how long did an application spend blocked, waiting for MySQL” or “How long did an application calculate the 1000’s digit of pi”. This allows a wholistic view of the entire stack of an application, rather than simply a per-application slice.
Such tracing is reportedly very useful for debugging complex system failures, in which services much be otherwise profiled separately and blocking resources perhaps not apparent.
Magento 2 Profiler
Magento 2 has for many years included profiling as part of it’s core application stack. However, consuming the information prepared by this tooling is often more overhead than it would be worth, providing considerable additional pain as compared to other tools such as the Valgrind out generated by XDebug and consumed by KCacheGrind.
However, with the advent of standard transaction tracing tooling and tools that help analyse these kinds of profiling outputs, it makes more sense to reevaluate the Magento profiler to determine if it provides additional context that may not be available in the Valgrind output analysis.
Additionally, the Magento profiler offers the promising possibility to trace transactions across arbitrary application borders, without relying on the code representation or callgraph views that valgrind outputs require.
OpenCensus
Tracing has been around in some sort of design at least as far as the initial publication of Zipkin by Twitter in 2012. However, there are few “standard” approaches to transaction tracing, nor are there are common libraries that are in use to express transactions to the various systems.
Currently, there are two main “competitors” to become the defacto standard:
Both record tracing in their own abstract format, but allow expressing it to an arbitrary number. However, OpenCensus seems advantageous in a couple of ways:
It’s sponsored by Google, and the engineers who are working on this project are specialists in their field. Google monetize this through their StackDriver transaction tracing tooling.
It includes time series data additionally, which allows storing two types of diagnostic data with one library and set of APIs
There’s a native PHP extension being developed
Googles penchant for killing projects arbitrary aside, OpenCensus appears to be a promising candidate for long term investment.
Implementation
The goal for this work is to try and stub out the Magento 2 profiler with transaction traces. There should be minimal work required to do this, as they’re both … kindof of the same. However, as it turns out, it’s not super easy.
Magento already separate out the profiling into two separate abstractions:
Driver (something like a “sampler”)
Output (something like a “renderer”)
There are other projects that take the existing output and modify it such that it renders pretty graphs, however many of the sampling libraries that exist both to sample and export the content. In order to take maximal advantage of these libraries (and any implementations in the PHP runtime itself that they may offer), the design quickly became sub out the driver itself as well as the exporter.
Shortly, this ran into problems. Magneto already allows users to stub out different formats with the profiler.flag
file, even allowing them to inject in arbitrary third party classes. However, they do not allow easily substituting the driver itself.
To that end, a new pull request was created that allowed substituting even the driver in a minimally invasive way, taking advantage of existing properties of the application.
Once this pull request was created it became relatively easy to swap out the driver for the OpenCensus implementation. Though the docs are a little opaque (indeed, I’m not quite sure whether I implemented things correctly) I was able to both implement a primitive which correctly seems to sample the application, and allow arbitrary configurable exporters.
Results
Well, see below:
In the end, though I tried a couple of different viewers I settled on Jaeger as it’s owned by the CNCF project (I think) and it’s quite pretty. Zipkin, though similar, wasn’t a nice to work with.
Given the above it’s quite easy to spot some of the following performance hotspots:
Things are not being cached. There’s a larger number of spans, and the spans go for longer.
Hotspots in the application. Matching routers takes a surprisingly long time (see above)
What’s happening on a super granular level across spans.
In Conclusion
I’m not 100% sure whether this will provide sufficient value by itself. I think more work will be required; particularly I’d love to see:
Spans propagating to requests downstream
More specific profiling of client applications — Redis, et. al
Comparisons in spans
I think this will likely be a result of my learning about transaction tracing, rather than a limitation of the APIs. Additionally, OpenCensus is still in alpha — it’ll be a while before this reaches prod.
Thanks
David Manners
The Comwrap team
Lyzun Oleksandr, who encouraged me particularly to explore this in the time allotted (I worked pretty hard!)
All others who participated in the event.
Caveats
This got yolo published. I haven’t edited it, and will probably do so in subsequent days if people read it.