Bringing fair database queries to our customers

on February 3rd 2022

Problem

Chronosphere customers query their metric data using dashboards, automated alerts, and ad hoc exploration queries. Query results can vary from a few data points to billions. A query resulting in billions consumes far more compute resources. This can lead to large queries starving small queries of database resources since customers purchase a finite amount. A typical example we see is a customer’s critical alert queries starved by a user running a large ad hoc exploration query. Our customers wanted a way to share their purchased resources across all users fairly.

Rate of alert queries/sec. An expensive ad hoc query runs at 20:00, temporarily starving alert queries from running.

Background

Before detailing the solution, it’s helpful to understand how Chronosphere processes queries at a high level. A query consists of a filter (e.g http_errors{service=”app-server”}) and a time range (e.g 2pm – 5pm). Chronosphere persists customer’s metric data in time window data blocks (e.g., every 2 hours) and indexes the data into separate time window index blocks. The index blocks map labels values to records in the data blocks to efficiently find the matching data records for a query.

A database query consists of index lookup and data retrieval.

The resource usage profile of a query is partitioned into two steps.

  1. Index Lookup – Traversing the index blocks looking for matches is CPU intensive since memory already holds most blocks.
  2. Data Retrieval – Retrieving data blocks is memory intensive since it may page new data blocks from disk into memory.

Historically nothing prevented a single user from consuming all the cores and memory, temporarily starving other users from running queries.

Goal

The goal was to fairly share compute resources when the system is under high concurrent load and allow a single user to consume all resources if the system is idle. In other words, we did not want to throttle users artificially.

Index scheduling

Large queries dominate

There are a fixed number of goroutine workers reserved for index processing. Typically there are more queries to process than workers available, resulting in queries blocked waiting for a worker. Once a query is scheduled on a worker, it runs until it’s done processing the index. 

The entire index is searched when holding a worker.

This can result in large queries delaying small queries from running.

Large queries block small queries from running.

We introduced an iterator pattern to allow a worker to process a chunk of the index at a time and then yield to another query waiting. When the query is scheduled again, it resumes scanning the next chunk of the index.

The iterator yields the worker after processing an index chunk

Large queries are chunked up into smaller units of work, allowing smaller queries to be scheduled sooner.  

Small queries now interleaved with larger queries

Many user queries

Large queries are not the only way to block other queries. A user can execute many queries in parallel from a dashboard and monopolize all cores at once, locking out all other users. 

A user can block other users with many parallel queries.

We needed to replace the first in, first out (FIFO) scheduler with a fair scheduler. The first attempt was to round-robin among all users waiting in the queue. In theory, user one would no longer block user two from running.

Round Robin in theory with equal slices.

Surprisingly, round-robin did not solve the problem. Large queries could still delay small queries from running. The issue was that each query was allowed to run for an equal amount of time, whether it needed the full-time slice or not. The time slice chosen was milliseconds to avoid goroutine schedule thrashing and allow small frequent queries to only need one scheduling slice. 

Round Robin in practice with unequal slices.

Rereading the Round Robin literature exposed this key insight:

It may not be desirable if the size of the data packets varies widely from one job to another. A user that produces large packets is favored over other users. In that case, fair queuing would be preferable.

The fair queuing implementation maintains a heap of waiting users sorted by least index time used. The user with the least index time used is scheduled next.

Fair queueing gives equal index time to all users.

The main downside of fair queuing is the log(n) complexity to insert and remove waiting users from the priority queue. More efficient scheduling algorithms, like deficit round robin, sacrifice some fairness for efficiency. Ultimately, the number of waiting users is typically quite small, so efficiency gains would be negligible. Our actual implementation maintains a small sorted list for simplicity.

Data scheduling

A similar problem happened when retrieving data records from disk. Record retrieval was also first come, first serve, allowing a single user to consume all the available memory allocated for paging in data records from disk. Again the solution looked similar. We added an iterator to request records from disk and the same scheduling implementation, tracking bytes read from disk instead.

Memory issues

Unfortunately, pausing large queries resulted in higher memory usage overall. A query requires holding the entire RPC response in memory until it completes. Pausing queries resulted in more, typically larger, responses in memory waiting for disk access to complete.

Responses buffered in memory waiting to read from disk.

Introducing an RPC stream allowed the server to flush data records to the client after reading from disk, removing the need to buffer the records on the server.

Data records are streamed to the client, removing memory pressure on the server.

Results

The new scheduler prevents heavy ad hoc users of the system from disrupting the critical automated queries. The graph below is how we monitor the index usage of our customers to ensure alert and recording rules are not disrupted by dashboards. A spike in dashboard traffic results in a temporary delay for some dashboard users while alert and recording rules are unaffected.

Total CPU time used by index workers on a single database instance. A spike in dashboard usage does not disrupt critical alert and recording rules. 
Total wait time for an index worker on a single database instance. Dashboards are temporarily delayed since the resource quota has been exhausted.

This project also enabled us to launch key insights for our customers. Chronosphere customers can now monitor which users are consuming system resources. This makes it easier to find problematic dashboards and rules impacting other system users.

Customers can see their most expensive recording rules.
Customers can see their most expensive alerting rules.
Customers can see their most expensive dashboards.

Interested in what we are building?