Comparing queries to validate Chronosphere releases

We recently published a post detailing how Chronosphere validates releases, what we learnt from the process, and what we needed to fill the gaps we identified. This post digs deeper into what we developed to fill those gaps, a tool called “Query Compare,” or “querycmp” for short. Querycmp solves the problems outlined by emulating a real environment performing under significant real-world write and query load, with the capability to spin up as a load testing suite for custom builds.

TL;DR, querycmp runs a real-world suite of queries against both a real environment and a mirrored environment (an environment cloned from the real one, with matching components and resource capacities, and receiving the same write traffic), then compares results to make sure they match.

There are three phases for the typical lifecycle of a querycmp validation run:

Setup is the work required to bring up querycmp in a ready state.
Execution is the querycmp run and execution, including all relevant instrumentation and reporting.
Investigation is an external phase to observe and act on comparison results.

Setup

The setup for querycmp involves bringing up an environment provisioned with the same resources as the dogfooding meta environment and then dual-writing to both environments to ensure they contain the same set of data. Provisioning a mirrored environment makes heavy use of the M3DB Operator, to deploy an entire suite of M3 components, leveraging Kubernetes to create a fresh stack and check it’s healthy. After the stack is healthy, our LaunchDarkly integration notifies a service called “reflector”. This service receives forwarded write traffic from a real environment and proxies that traffic to any number of configured target environments.

At this point, tests wait for a specified amount of time before kicking off a querycmp run. This delay is necessary because there needs to be enough metrics written to the test environment, in terms of volume and time range, before we can reasonably expect all query results between meta and the test environment to match. After the wait period, it’s time to run querycmp.

Execution

As querycmp starts, it pulls a collated list of queries from a cloud blob store (a Google Cloud Storage bucket). The list consists of the scraped set of all alert and dashboard queries used in the meta environment and a set of user-defined queries for applications that our dashboards rarely use. For example, we don’t use many dashboards for predictions or forecasting, so functions like holt_winters() don’t see much use and are added to the list. As automated testing is a case of “the more, the merrier,” we plan to pull ad-hoc queries used by engineers when inspecting data and automatically deduplicate and populate them into the user-defined query list.

Now the comparison logic starts. Each query in the list is concurrently issued against the meta and test environments, using the same start/end/step arguments, and then the results are compared in a point-wise fashion to ensure they match. After the comparison, we increment a metric based on the result:

Matched, if both series are exactly equal. In this case, continue to the next query in the list without any further work.
Mismatched, if there are any differences between results. In this case, capture the mismatch diff and generate a file containing the mismatch summary, uploading it to a “Mismatch” cloud storage bucket that includes the ID for this run.
Errored, if either the meta query throws an error or the tested environment throws an error. For example, if a dashboard has a malformed query that errors out before even fetching data. In this case, upload an error log to an “Error” storage bucket for the run ID.
Skipped, if there is uncertainty that the query has returned a full set of data. For cases where, for example, the output set of a query that matches every series, like {__name__!=""}, is artificially limited to a smaller set. Since this limiting doesn’t always happen in a deterministic fashion, we can’t confidently say that running it on meta and the tested environment would yield the same results, so instead we upload a summary file featuring the reason for skipping to a “Skipped” bucket.

This comparison logic runs against every query as an instant query using the instant query endpoint. After these comparisons are complete, we go through the query list again, but instead execute them as range queries, with particular logic for selecting query start and end times that allow the selecting of an appropriate time range containing a complete set of data (Read more in the deep dive section), emitting similar metrics and summary files with an indication that the query_range endpoint specifically generated them.

Once the query range comparisons have been completed, comparisons against metadata query endpoints start, which are queries for finding available label names and values. The comparisons run differently from query endpoints, as there is no list of “queries,” but instead, we generate a list of queries to run from the results themselves:

A label_names query runs on the tested and meta environments to get a list of all valid label names.
Compare these lists and emit a “Matched” or “Mismatched” metric based on the result (with special care taken to exclude certain label names from series that may be self-emitted from the tested environment, as these would pollute mismatches).
Take the intersection of the label names and values from the previous steps to create a list containing all valid label names to expect to see between the environments.
For each of these label names, perform a label_values query against both environments.
Compare the results like the label names comparison in step 2, but “Matched” or “Mismatched” metrics and artifacts are specifically scoped to label values for differentiation.

After completing all label_values comparisons, a single “run” ends, and further behavior depends on how the command was triggered. In most common test situations, another run begins from the execution phase. This extends the valid range allowable for comparisons, as we can be sure that the time the first run started (i.e. the time at which there were enough valid metrics for comparisons), will be valid for all subsequent runs, since metrics emission remains enabled.

Investigation

As mentioned above, querycmp emits metrics to indicate progress. After all, M3 is a metrics store, and when you have a metrics store, you tend to want to deal with metrics. We compile results from mismatched runs and upload them to a cloud storage bucket with one-week retention for manual retrieval and inspection. We rely on querying the emitted metrics to calculate a running match rate to see how closely the new build tracks against the current stable version.

Assembling the testing process

Temporal scenario testing

Getting the testing process running has a few moving parts:

Setting up an environment
Deploying components
Waiting for components to become healthy
Forwarding writes
Waiting for forwarded writes those to become healthy
Kicking off runs

We monitor the metrics querycmp emits to ensure it runs as expected and doesn’t cause mismatches during the whole process. To coordinate this, we use Temporal, the same tool used for scenario testing, to compose these steps as distinct workflows. Temporal makes the querycmp cycle repeatable, automated, and perfect for including in scenario tests, which already rely on Temporal and run continuously against release candidates.

A continuous parallel verification check runs alongside the querycmp process, which ensures there are no mismatches, and that there aren’t too many errors which could indicate a broken build. If this test fails, we label the entire scenario run as failed and signal it as such, flagging it for manual review to find the root cause of the error. Continuously running comparisons in this manner is useful for ensuring no read/write degradations occur when performing the complex actions modeled by scenario tests, such as adding a database node to an active cluster.

Ephemeral environments

On top of querycmp is CLI tooling (called “droidcli”) for starting the testing process by running something like the following command:

droidcli querycmp start --environment rc_01

This helped foster confidence in development builds, especially for long-lived branches and complicated changes that were previously painful to test, like the index active block refactor work.

A tweak to the core flow gives the option to run querycmp as a testbed for query performance under load. We turn off comparisons entirely and instead run only against the test environment when running in this mode.There are some additional levers to tweak, for example, the concurrency at which to send queries, simulating periods of extreme load. Running in this mode has helped expose and reproduce issues with query degradation and cases where some queries are allowed to run faster than others arbitrarily, which are difficult to simulate. Unfortunately, it’s not realistic to keep comparison verifications running in this mode, as it would thrash the test environment and our meta environment.

Deep dive into the testing process

Acting on results

How long do we wait until there’s enough data to compare results confidently?

Familiarity with Prometheus staleness mechanics helps answer this.

In summary, a point in a given series is valid for 5 minutes after it’s emitted, until another write to this series arrives within the 5-minute window. Any query, even a lookup like foo{bar= "baz"}, needs to have at least 5 minutes of data to account for any sparse series emitted just under 5 minutes before the query start.

Any queries that act on range vectors, e.g. sum_over_time(foo{bar= "baz"}[10m]) don’t follow these staleness semantics and instead need data that covers the entire range for the matrix. In the example above, this is 10 minutes before the query start. Subqueries complicate this further by requiring an extra range of data on top which services the underlying query.

To account for this complication, we delay for 5 minutes after write forwarding is ready to ensure there’s enough data to service the majority of instant queries. We can calculate the total limit for sufficient data to service all queries that don’t have explicit ranges or have range selectors (over time functions) with ranges less than 5 minutes. We use Prometheus’ parser Inspect command, which parses a query string into a traversable tree, allowing calculation of the maximum time range required for the query. We skip queries with range selectors that exceed this standard wait until enough time has passed from startTime to inspect them. In later runs, we run more queries as they become valid. This is fine in practice, as the 5m window services over 90% of queries.

Liveness delay

Since we mirror writes to the tested environment from meta, we can’t be sure that any query with a “now” endTime has all points available in both environments due to non-deterministic propagation delays. These delays can arise from network degradation between m3reflector and the test environment, a heightened datapoint ingestion delay under heavy load, or any other number of distributed system issues. Because of this, we introduce an artificial delay for query endTime’s, to allow enough time for all queried datapoints to be present in both environments. Choosing an appropriate period took trial and error. Too short results in mismatches, which are hard to diagnose since inspecting them even a few seconds after the reported mismatch shows data that does match. Too long delays the time to launch of querycmp, since we must wait 5 minutes from startTime to account for staleness, until selected liveness delay from endTime to be sure of a valid comparison window. This delay, in turn, increases the duration of a testing cycle without providing any additional verification. Ninety seconds turned out to be a fine balance between concerns.

Index and data block split

Running querycmp as validation for new builds, we saw persistent failures in label name and value comparisons. These mismatches were difficult to diagnose because the data showed that these labels didn’t exist in the query period on the tested environment, which is expected for a mismatch, but they were also missing from the same range in the meta environment. After a lot of detective work, we discovered the reason, the architecture of M3’s index, and data blocks.

The following summary is a simplified version of the actual issue, for more information on index and data blocks, read the M3 architecture documentation.

At a high level, M3 operates in terms of index and data blocks, where index blocks contain a list of label name/value pairs emitted while that block was active, with an offset pointing into the appropriate data block for the same period. The data block contains compressed timeseries data for that series, which means that a series can “exist” in the index block for a given query time range but not return any results, as any actual datapoints may exist outside of the queried range. This caused issues in querycmp due to index lookups handling label lookup queries exclusively and not touching data. This in turn led to situations where some sparsely written series had labels added during the “spin-up” period of the tested environment, where those writes had not yet been forwarded. This period was outside the valid query range, so wasn’t included in any query comparisons but still yielded those results for label lookups, since it was still in the same index block. To get around this issue for any mismatched labels, for example, label_foo, that doesn’t exist in the tested environment, we run a data query count({label_foo!=""}) against meta, to ensure that the data does exist, and was forwarded.

Next steps

This post covered a deep dive into the query comparator tool and testing in general at Chronosphere, continuing from part 1 of this series. Future posts around these topics will look more closely at how Chronosphere uses Temporal and more detail on the active index block changes touched upon briefly in this post.

Recent News

Featured Resources