X-Ray: Making derived telemetry transparent

TL;DR

Metric aggregation, using things like Recording Rules, is great for performance and makes queries easier to work with. But it comes with an annoying, hard-to-solve side effect: once a metric is simplified, it’s hard to figure out what data it actually came from.

At Chronosphere, we’ve lived with that side effect just like everyone else with a Prometheus-compatible metric backend. That is, until now. I wanted to take some time to talk about X-Ray: a tool that lets you peel back layers of derived telemetry and get to the underlying queries.

Life before X-Ray

At Chronosphere, we’ve been living with a long-standing annoyance: once a metric is abstracted behind another simplified metric, using something like a Recording Rule (RR) or a Derived Metric (DM), it’s incredibly hard to connect the two together to figure out what underlying metric data actually is. This was never ideal, but we tolerated it. It was just one of those little annoyances that flew under the radar. We just didn’t think about it.

The situation went from slightly annoying to downright frustrating when we built our internal SLO system. Even though it was a system we understood and controlled, it still required a lot of time and work to use the chronoctl CLI commands and manually scan YAML to get to the information we needed. The pain was real, but localized. So we continued to live with it.

Once we started building our customer-facing SLO feature, as well as Differential Diagnosis (DDx) for Metrics, we realized we were about to export the same problem to our customers and put a spotlight on it. Our customer-facing SLO feature heavily utilizes DMs and RRs under the hood. DDx for Metrics requires visibility to metric dimensions to conduct correlation analysis in order to highlight what changed and what didn’t. Metrics produced by RRs and DMs often have a limited set of dimensions, making DDx for Metrics far less effective in surfacing anomalies. We were really excited about both of these features, but they just didn’t work together. Not ideal.

What’s needed

We needed a way to look at a query and recognize when it is built on layers of derived telemetry, and to peel back those layers. It not only had to be able to see through those layers but rewrite and substitute those queries to the lower-level equivalents that likely have more dimensions on them (labels / values).

To do this, my team developed X-Ray: a tool that enables you to take any query using derived telemetry produced by DMs or RRs and follow it back to its original source. Here is how we did it.

The Problem: Derived telemetry is a black box

Here’s a scenario our engineers know well:

An SLO burn rate alert (defined by a recording rule) fires. As the on-call engineer, you open Metrics Explorer, but find nothing but the abstracted metric name with very few dimensions. It looks something like this:

Figure 1. Chronosphere Metrics Explorer

Very unhelpful. You try DDx, but that doesn’t really give any new information, because the original cardinality was aggregated away. At this point, your only recourse is to:

Guess that the metric was generated by a recording rule.
Dump all RR definitions via chronoctl recording-rules list > rr.yaml.
Manually look through YAML to find the match.
Copy the PromQL expression back into a query window.

The whole thing can take 15–30 minutes, and that’s if you already know your way around the system and the recording rule is simple. RRs and DMs can be much harder to reason about if they are more complex:

RRs and DMs can be defined in terms of other RRs and DMs. This means you have to go through the above process again to find the real root metrics
DMs have “variable selectors”, which can be provided by the query or use a default. For example, if the DM maps real_metric{$env} (with the variable $env defaulting to env="prod") to dm:metric, then a query to dm:metric should be replaced with real_metric{env="prod"} and a query to a query to dm:metric{env="test"} should be replaced with real_metric{env=”test”}. When you are trying to manually recreate a query, you need to keep these in mind and make sure you are using the correct value.
RRs can reuse a metric name. This means just searching the chronoctl output for the metric name is not enough. Each RR adds a hardcoded set of labels to each datapoint it records, so you have to look at these labels for each and every RR manually to find the exact one that matches your current case.

With all this complexity, you wind up wasting a lot of precious time just to understand what the metric even means, let alone what caused the alert. This is inefficient, error-prone, and just plain frustrating.

Technical challenges and how we solved them

When I first started prototyping this feature, it seemed like it would be relatively straightforward. After all, we have all the RR and DM definitions in our DB already. All we have to do is just, like, undo it. But, as with most engineering problems, it is far more complicated than it seems on the surface.

One challenge we came across was that DMs and RRs can be nested arbitrarily. This usage pattern was far more common than I would have thought, so it was the first problem we had to solve to make the feature usable. So given a Prom query, we’d have to:

Parse it
Get all distinct metric names
Check if each name was a DM or RR
Get the Prom query for each of those names
Recurse

Laid out like that, it became clear this was just a tree-traversal problem. That’s no biggie! The prototype used a recursive depth-first traversal to build a derived telemetry tree data structure. Problem Solved 🙂…

Except, the prototype UI with a tree-like display was useful to me because I like seeing all the layers at once. However, only a few of the folks we showed it to were able to interpret it.

Figure 2. Example of the nested query tree UI. In some cases one query can be made of derived telemetry that references other derived telemetry.

This leads to the next challenge: How do we make a usable UX for this feature? The engineering team loved seeing all this data up front, but we are too close to the system. We had to accept that this just wouldn’t be useful for actual users.

So what would be useful? Maybe we could show the data “one layer at a time” to prevent information overload. But what does that even mean? Do we pull the child expressions into the root, and fully replace all derived telemetry in the current query with their original expressions? Or do we only explore one branch at a time, and allow you to focus on only one of the derived telemetries?

After a lot of back and forth within the engineering team and our product manager, we ultimately landed on showing users all of the child expressions of the current query, and allowing them to follow only one of those children at a time. This keeps information overload to a minimum while still allowing the user to peel back as many layers as they need.

Finding the best approach

This decision made us realize that a depth-first tree traversal was not the correct approach for what we needed to accomplish, and so we scrapped the tree traversal and replaced it with an API that simply returns all child expressions of the given expression. If the user wants to peel back more layers, we would submit the child expression to the API again to get the next set of children. Problem Solved 🙂…

Except, the prototype isn’t handling two of those complexities I mentioned earlier: DM variable selectors and duplicate RR names. Results can be downright wrong if these issues aren’t handled. Luckily, we already handle resolving DM variable selectors in our query engine. We just needed a refactor to make that logic reusable, and that problem is solved.

Duplicate RR names is a trickier problem to handle. It turns out that there is no way to uniquely identify a specific RR definition from a Prom expression. The closest we can get is using the metric name plus the recording rule labels. Because of this, we can get into a situation where there are multiple RR definitions that are applicable to a given Prom expression, so how do we resolve this? Well, we have to approximate. If we make the assumption that two Prom expressions, A and B, each produce a distinct set of series from each other, then the Prom expression A + B should produce results equivalent to Result(A) ∪ Result(B). This assumption is not necessarily true for RRs, but it’s a reasonable assumption for the use cases we see. Problem Solved 🙂…

Except, the prototype just loads every DM and RR into memory and does a linear search by metric name every single time. That just won’t do for production. Some of our customers have hundreds of thousands of RRs and DMs. Doing a linear, in-memory search on this much data is a bit slow, and it gets slower the more metric names in the query that we need to look up. Now we need to make querying DMs and RRs efficient.

Once again, this is easy for DMs. Metric names are globally unique for DMs, so we just need to make a simple unique index in our DB. Once again, RRs are more difficult due to their lack of uniqueness. Relational Databases 101 says for such a use case, it would be best to use a simple intermediate table like this:

rr_id	metric_name	label_name	label_value
1	slo:sli_error:ratio_rate1h	service	chrono-metrics
1	slo:sli_error:ratio_rate1h	slo	availability
2	slo:sli_error:ratio_rate1h	service	derivedmetric
2	slo:sli_error:ratio_rate1h	slo	availability

We would then construct a simple SQL query based on the given metric and selectors. If the given Prom expression was rr_metric{service="chrono-metrics, slo="availability"}, we could get applicable RRs using a query along these lines:

SELECT rr_id
FROM rr_labels 
WHERE metric_name = 'slo:sli_error:ratio_rate1h' AND ( 
 (label_name = 'service' AND label_value = 'chrono-metrics') OR  
 (label_name = 'slo' AND label_value = 'availability')
 )
GROUP BY rr_id  
HAVING count(DISTINCT label_name, label_value) = 2

Except, there is an important caveat here that this simple query can’t handle: we don’t know which selectors in the Prom query correspond to “real” labels and which correspond to RR labels. This SQL query will fail if any of the selectors are not RR labels (e.g. rr_metric{service="chrono-metrics, slo="availability", pod=”abc-123”}) OR if we fail to provide any of the RR labels as selectors in the Prom expression (e.g. rr_metric{service="chrono-metrics}). This SQL query needs to be more lax, and thus more complicated. We can’t say for sure whether a set of selectors definitely matches an RR, but we can disqualify RRs that definitely do not match. So we came up with this:

SELECT DISTINCT rr_id  
FROM rr_labels  
WHERE rr_id NOT IN (  
   -- This subquery looks for all RRs that are explicitly disqualified from matching against our input  
   SELECT DISTINCT e2.rr_id  
   FROM rr_labels e2  
   WHERE e2.metric_name = 'slo:sli_error:ratio_rate1h' AND (
	 (label_name = 'service' AND label_value <> 'chrono-metrics') OR  
	 (label_name = 'slo' AND label_value <> 'availability')
	)   
  )
AND metric_name = 'slo:sli_error:ratio_rate1h'

Except, there is another, more frustrating, caveat that this query cannot handle. Some selectors use regexp (like service=~"(chrono-metrics|derivedmetric)" . You would think this would be as simple as making the above query look like (label_name = 'service' AND label_value NOT REGEXP '(chrono-metrics|derivedmetric)'), but no. We have a regexp pattern that comes up a lot in Prom queries on our platform that looks like (|option1|option2), but this is not a valid regexp in MySQL 5.7. It works in MySQL 8, which our infra team is hard at work migrating to, but for now, we need a workaround. That workaround is to pull the disqualification logic out of SQL and into code. It’s certainly slower and more verbose this way, but it will get the job done until we are fully migrated.

Finally, after all these challenges, it works! Now we “just” have to backfill this table and keep it up to date. That is a whole other can of worms that another one of our amazing engineers spent weeks optimizing, but I won’t go more into that here.

Real example: What X-Ray fixes

Before X-Ray:

An alert fired on slo:sli_error:ratio_rate1h.
DDx showed no details due to aggregated telemetry.
You had to dump YAML and manually search for the right RR.

With X-Ray:

Click on the metric.
See the exact PromQL used to generate it.
All selectors are merged automatically.
Drill into the upstream metric using DDx.

Figure 3: X-Ray peels back the layers of derived telemetry to get to the underlying data and its dimensions.

What’s next

It’s early days for X-Ray. We see a growing number of ways and places we can use this new capability across our platform. A few of the things we looking at next are:

Multi-layer traversal: Optional full-tree expansion for power users.
Traceable audit logs: See how RRs/DMs evolve over time.
X-Ray everywhere: Run X-Ray anywhere you can read or write a Prom query

Recent News

Featured Resources

X-Ray: Making derived telemetry transparent

TL;DR

Life before X-Ray

What’s needed

The Problem: Derived telemetry is a black box

Manning Book: Effective Platform Engineering

Technical challenges and how we solved them

Finding the best approach

Real example: What X-Ray fixes

What’s next

Find Out How DDx for Metrics Can Help You

Share This:

Table Of Contents

Featured Resources:

Manning Book: Effective Platform Engineering

Table Of Contents

Related Posts