Learn how Chronosphere Histograms works with our Control Plane – enabling customers to manage, transform, aggregate and reduce the costs of histogram metrics at scale as efficiently as any other metric.
Scott Kelly is a Sr. Product Marketing Manager at Chronosphere. Previously, he worked at VMware on the Tanzu Observability (Wavefront) team and led partner go-to-market strategies for VMware’s Tanzu portfolio with AWS and Microsoft Azure. Prior to VMware, Scott spent three years in product marketing at Dynatrace. Outside of work, Scott enjoys CrossFit, tackling home improvement projects, and spending time with his family in Naples, FL.
Victor Soares is the Metrics Platform Product Manager at Chronosphere. Victor spent the first part of his career as a software developer and technical account manager at various companies. He started his product management career in observability at New Relic, where he discovered his passion for helping developers and operators troubleshoot and understand their systems. Victor lives and works in Portland, Oregon, where he enjoys cycling, hiking, soccer, basketball, and adventuring with his family in the Pacific Northwest.
On: Aug 6, 2024
Today, we are excited to announce Chronosphere Histograms – an innovative histogram type designed to simplify implementation, improve query efficiency, and deliver more accurate results. Chronosphere Histograms provide an implementation of multiple histogram types, including OpenTelemetry exponential histograms and Prometheus native histograms. Plus, Chronosphere Histograms work with our Control Plane – enabling customers to manage, transform, aggregate and reduce the costs of histogram metrics at scale as efficiently as any other metric.
Histograms are an important tool for analyzing the performance of modern software systems. Ask developers, and they will tell you. Ask those same developers about the shortcomings of histograms, and you’ll get an earful. Explicit bucket configuration, high likelihood of high margin of error, the list of pitfalls goes on. And the more complex systems get, the more both their value and difficulty increase.
Histograms have been used in observability for a long time. Legacy histograms were added to Prometheus in 2015. However, their initial design presents problems when used in the scale or highly complex environments we see today.
The open-source community recognized the increasing challenges modern architectures create for histograms. Prometheus and OpenTelemetry agreed upon a new compatible histogram type that solves the problems inherent in the design of legacy Prometheus histograms.
We’ll delve into the specific challenges shortly. But first, let’s quickly review what a histogram is.
Histograms are a general mathematical tool for graphically showing the frequency of distributed data values. They are displayed like bar charts, each representing a range called a “bucket.” The bars count the measurements that fall into each bucket.
An everyday use case for histograms in observability is measuring latency. Because latency can vary widely, using just an average or median (p50) often doesn’t accurately reflect what many users experience. To get a complete understanding, you need to examine the “long tails” of latency. Histograms allow you to do that and additionally compute different statistics across arbitrary dimensions and time windows, in a way that pre-computed percentiles do not.
Let’s look at a very simplified but clear example of how histograms are helpful:
If a service handles 100 requests, with 95 completing in under 40 milliseconds (ms) and 5 taking 500 ms, the average response time is 63 ms, and the median (p50) is 40 ms. However, this misses the fact that 5% of users experience much slower response times. Using a histogram to summarize the latency of all 100 requests lets us easily visualize the 95th percentile (p95) and see that 5% of requests experience 500 ms response times.
If you are a large eCommerce company or financial institution, delays affecting 5% of customers can mean massive revenue losses and customer dissatisfaction.
Histograms help tackle these issues. They visually show how often data values occur within different ranges. This makes it easy to spot unusual patterns and outliers. As a result, developers can diagnose and resolve issues more quickly.
While precomputed percentiles historically have also tried to offer meaningful statistics like this, they are inflexible and unergonomic at query time allowing usually only per-host statistics which becomes increasingly inaccurate when calculating the long tail of a busy endpoint or operation across many instances.
The challenges with legacy Prometheus histogram stem from two major design constraints. First, the legacy Prometheus histogram isn’t a first-class metric type and instead is a compound metric. It’s scraped and persisted as multiple counter time series, which is both inefficient and constraining.
Second, it only supports an explicit fixed bucket layout that each time needs to be specified before use. Once defined, the bucket layout doesn’t change, regardless of the values being recorded. This rigid structure isn’t practical in modern, complex environments where the distribution of data changes over time.
These two issue combine to cause the following major problems:
Here’s how the new single-value Chronosphere Histogram and new exponential growth bucket layout available in OpenTelemetry SDKs and Prometheus clients combine to solve the legacy Prometheus histogram problems:
Chronosphere customers can send OpenTelemetry exponential histograms (delta or cumulative) via the OpenTelemetry Collector to Chronosphere or they can use the Chronosphere Collector to scrape Prometheus native histograms. All Chronosphere Observability Platform features work with both histogram formats and the exponential growth bucket layout.
Chronosphere’s Control Plane helps customers reduce the costs of Histogram metrics, like it does for any other metric. Customers can define aggregation rules to reduce cardinality and drop rules to drop an entire histogram time series. A new HISTOGRAM aggregation function gives customers a way to aggregate measurements or gauges. For example, when aggregating away the instance label on a container metric like container_memory_usage_bytes, the HISTOGRAM aggregation function records the distribution of all gauge values in the time window.
We’re excited about the new problems customers will solve with the Chronosphere Histogram metric type and Control Plane features!
Request a demo for an in depth walk through of the platform!