This blogs lays out the pitfalls of exemplars while explaining a better approach to linking distributed traces and metrics.
On: Oct 26, 2021
Since we announced support for ingesting distributed traces into our platform, one of the questions we’ve gotten is whether we use exemplars to link metric data to traces. While our solution does provide deep linking between metrics and traces, we do not rely on exemplars to achieve this. This post is to help explain why, by looking at the limitations of exemplars as an approach to link metrics/traces.
Exemplars are references to data outside of the metrics published by an application. In addition to the metrics an application publishes, it also publishes a reference to some other data (ex: a traceID) that relates to what we are measuring. In practice, this gives developers an easy way to jump from a metric that represents thousands of requests, to a distributed trace that provides context on what those requests may actually look like. Prometheus added experimental support for writing exemplars to remote endpoints in their v2.27 release back in May, but the concept has been around for longer as part of the OpenMetrics specification.
Exemplars give us an easy way to jump from metrics to a relevant distributed trace by publishing metrics with TraceIDs as exemplars. However, there are two major limitations to this approach:
Let’s look at an example of this problem in practice: Suppose you are alerted that your authentication service for a mobile application is seeing a spike in errors. You go to the relevant dashboard and click on an exemplar around the time of the spike, and the distributed trace shows you an authentication request from an Android user has timed out. That’s probably more information than we had looking at the dashboard, but there are a lot of important questions we might want to ask that this exemplar trace cannot answer on its own:
In order to answer any of the above questions, you’ll have to navigate over to your tracing tool. This is what you would do even without exemplars, assuming your tracing tool allows you to perform the types of analysis required to answer them. If it does not, then your last resort is to examine exemplar traces one by one, the equivalent of sorting through individual pieces of hay until you find your needle.
To be clear — exemplars CAN be useful and are certainly an improvement compared to the siloed state of metrics and distributed traces that many organizations live with today. We believe at Chronosphere that we CAN and MUST do better so that our users can get the full value of distributed tracing.
In order for users to realize the potential of metrics and distributed traces combined, they need the freedom to jump from looking at metrics on a dashboard over to analyzing all of their distributed traces for interesting patterns across the relevant dimensions. With Chronosphere, we not only make this possible, but also make answering the types of questions posed in our example an automatic process. Chronosphere users spend less time poring over individual traces for suspicious behavior and more time addressing the problem they came to triage and understand versus prescriptive workflows.
Request a demo for an in depth walk through of the platform!