What are exemplars for distributed traces and metrics?
Since we announced support for ingesting distributed traces into our platform, one of the questions we’ve gotten is whether we use exemplars to link metric data to traces. While our solution does provide deep linking between metrics and traces, we do not rely on exemplars to achieve this. This post is to help explain why, by looking at the limitations of exemplars as an approach to link metrics/traces.
Exemplars are references to data outside of the metrics published by an application. In addition to the metrics an application publishes, it also publishes a reference to some other data (ex: a traceID) that relates to what we are measuring. In practice, this gives developers an easy way to jump from a metric that represents thousands of requests, to a distributed trace that provides context on what those requests may actually look like. Prometheus added experimental support for writing exemplars to remote endpoints in their v2.27 release back in May, but the concept has been around for longer as part of the OpenMetrics specification.
The shortfalls of exemplars
Exemplars give us an easy way to jump from metrics to a relevant distributed trace by publishing metrics with TraceIDs as exemplars. However, there are two major limitations to this approach:
- As the name exemplar implies, you are looking at one example. This becomes a significant pitfall because the problems that require distributed tracing to triage and understand are multifaceted — you need to be able to analyze a population of distributed traces and surface differences to truly understand what’s going on.
- It’s up to the developer to determine how to pick which traceID is most “relevant” to publish as an exemplar. This additional step ends up being an exercise in futility, because, as noted above, all but the most simple problems generally require analyzing many traces at once to get the full picture, rather than trying to pull a single needle from the haystack, so to speak
A real world example of how exemplars don’t give you the answers you need
Let’s look at an example of this problem in practice: Suppose you are alerted that your authentication service for a mobile application is seeing a spike in errors. You go to the relevant dashboard and click on an exemplar around the time of the spike, and the distributed trace shows you an authentication request from an Android user has timed out. That’s probably more information than we had looking at the dashboard, but there are a lot of important questions we might want to ask that this exemplar trace cannot answer on its own:
- Is the spike in errors related only to Android users, or are iOS users affected as well?
- Similarly, does this affect all versions of Android? Perhaps it’s only a subset.
- Is there any correlation between client geography and errors?
- What about the error itself? Our exemplar trace shows a timeout – is that the cause of the spike though, or are there other errors being seen by clients at the same time?
- Also, if request timeouts are the issue, how does our exemplar’s behavior compare to other requests that do not time out? Is there a particular phase of its execution that has degraded?
In order to answer any of the above questions, you’ll have to navigate over to your tracing tool. This is what you would do even without exemplars, assuming your tracing tool allows you to perform the types of analysis required to answer them. If it does not, then your last resort is to examine exemplar traces one by one, the equivalent of sorting through individual pieces of hay until you find your needle.
You deserve better
To be clear — exemplars CAN be useful and are certainly an improvement compared to the siloed state of metrics and distributed traces that many organizations live with today. We believe at Chronosphere that we CAN and MUST do better so that our users can get the full value of distributed tracing.
In order for users to realize the potential of metrics and distributed traces combined, they need the freedom to jump from looking at metrics on a dashboard over to analyzing all of their distributed traces for interesting patterns across the relevant dimensions. With Chronosphere, we not only make this possible, but also make answering the types of questions posed in our example an automatic process. Chronosphere users spend less time poring over individual traces for suspicious behavior and more time addressing the problem they came to triage and understand versus prescriptive workflows.