Chronosphere has provided a framework to deliver the best possible observability outcomes at scale while keeping costs under control.
On: Jun 26, 2023
Scott is a Senior Product Marketing Manager at Chronosphere. Previously, Scott was in product marketing at VMware (via the Pivotal acquisition) where he worked on the Tanzu Observability (Wavefront) team and partner go-to-market efforts for VMware’s Tanzu portfolio with AWS and Microsoft Azure. Before VMware, Scott spent three years in product marketing at Dynatrace. In his free time, you’ll find Scott at his local CrossFit “box”, doing home improvement projects, and spending time with his family.
Cloud native environments emit a massive amount of observability data — between 10x and 100x more than traditional VM-based environments. As more organizations adopt cloud native, this explosive growth creates two problems:
The result is higher costs with worse outcomes, leading to a negative return on investment (ROI).
Chronosphere has led the charge in the industry on the ability to control cloud native observability data. Four years ago we pioneered the revolutionary approach to shaping observability data dynamically, in a streaming fashion, by aggregating, downsampling, and dropping data before it is persisted. Over the years we’ve continued to refine and improve upon what we started so customers can more effectively control their cloud native observability data.
During that time, we learned that shaping data, on its own, is not enough. Working closely with the leading cloud native organizations to manage observability data at scale, a framework for success emerged. We call it the Observability Data Optimization Cycle. It is a vendor-neutral framework, described in detail below.
Chronosphere is the only solution that provides all of the required capabilities outlined in this framework, via our Control Plane, to deliver the best possible observability outcomes at scale while keeping costs under control and maximizing the ROI of their observability solution and practice.
The results speak for themselves. Using our Control Plane, Chronosphere customers have reduced data volumes by 60%, on average — up from 48% a year ago. While this contributes to significant cost reduction, it also enables them to use observability data more strategically, improving business outcomes.
Some examples include:
Let’s take a closer look at the Observability Data Optimization Cycle to see how it can help your company.
Today, the observability team is in the unenviable position of having to play traffic cop and waste their time hunting down teams responsible for sudden spikes and overages in order to control costs and avoid performance issues with the observability system. This is because the teams creating the data have limited insights and few incentives to be proactive about controlling this growth. As organizations scale, it only exacerbates the problem. More and more smaller teams creating more and more data makes it near impossible for the observability team to stay on top of things, so the data volumes and costs spiral out of control.
Observability data costs are becoming a significant budget item and increasingly unpredictable. Everyone needs to understand their usage and take ownership of it to address both these issues. This means performing chargeback/showback, just as infrastructure/cloud costs are done today. Once you have visibility into consumption, you can take steps to make this more predictable, by allocating portions of the observability system capacity, with guardrails (quotas), to these teams.
Regarding consumption, decentralized decision-making is critical because the observability team cannot possibly know what is and is not essential to each engineering team. Each team knows its data best and is empowered to manage usage against its allocated capacity, driving positive behavioral change.
Setting quotas and driving decision-making to the teams responsible for creating the data allows centralized governance with shared responsibility to control data growth and predictability. Organizations that skip this step will never be able to achieve a long-term sustainable observability data optimization practice.
Chronosphere enables the observability team to understand how much capacity each team is using, which enables chargeback or showback. The observability team can assign appropriate guardrails to each team using Quotas and Limits functionality. By doing that, not only can the observability team ensure one team can’t crowd out other teams, but they also delegate responsibility for optimization decisions to the service teams themselves.
Quotas can be reconfigured by the observability team based on changing priorities. Each team pre-defines its priorities for how data should be handled if they approach or cross Quota limits. This lets them protect their most important data. Alerts can be set to notify teams when they are getting close to their limit so they can proactively address any capacity issues.
Before taking action on their data, engineering teams must be able to analyze the incoming data in real-time to address issues, such as a cardinality spike, currently impacting the observability system. And they also need the ability to understand both the cost and the utility of the data in order to make smart decisions. Just knowing if data is used or not isn’t enough. Ideally, they need to analyze the data and gain an understanding of whether it is useful or not based on characteristics such as:
To understand the cost, Chronosphere analyzes all the incoming data with our Traffic Analyzer to show teams information about the cardinality and volume of the various data sets in real-time and historically. User can:
To understand the value, our Metrics Usage Analyzer allows teams to see the cost and value of their data side by side. More specifically it:
Once teams understand the cost and value of their data, they can make decisions about how best to refine it in order to ensure the cost and value are in alignment. To deliver the best possible observability outcomes, refining must be more than simply eliminating data to reduce cost. Refining should also include capabilities to boost signal to tilt the scale in value’s favor. Refining policies must include the ability to:
Because of the dynamic nature of cloud native environments, teams need the ability to do this in real-time, and with precision, ideally without touching the source code or redeploying. They need the flexibility to apply policies globally, as well as address specific use cases to make refining manageable at scale, and not take copious amounts of developer productivity away.
Chronosphere enables teams to shape all data by default using either a global policy or specific policies per application, team, or arbitrary filter. This makes it feasible for teams to manage data at scale instead of writing rules for each individual metric name like other solutions. They can roll away entire dimensionality from all data or subsets of the data as needed without any active management required, whether by environment, type, team, or service-by-service.
Chronosphere provides multiple Shaping Policy options to reduce, transform, and amplify data. Examples include:
Application of these policies is done in real-time at ingest (streaming), which means teams have no delay in alerts and don’t need to store the raw data. Teams have the option to preview the impact of any new or updated shaping policies before they are implemented.
The result is reduced cost and improved performance, and best of all the end users don’t see a difference in their dashboards, alerts, or queries.
In a large-scale cloud native environment, the rate of change is massive. Over time, the effectiveness of refining policies can change due to the changing dynamics and changes in use of observability data collected. Teams need insights into how effective their refining policies are, so they can make adjustments when necessary. Without this level of transparency, teams can’t ensure they are using their assigned capacity/quota efficiently and the organization won’t be able to maximize the ROI of their observability practice.
Chronosphere’s Shaping Policies UI makes it easy to identify policies that have become inefficient. It also shows how each policy is configured and how it is performing. The Shaping Policies UI allows users to easily see detailed information about a policy, including the platform resources a policy consumes, how efficient it is, how much it contributes to storage, the change in the contribution, and when it was last modified.
To ensure fast-loading dashboards and alerts Chronosphere’s Query Accelerator continuously scans for slow queries and automatically creates a refined alternative that delivers the same results but with much faster performance. As a result, engineers no longer have to manually optimize queries or be proficient at writing “good queries.” They can create a query that returns the data they need, and Query Accelerator will ensure that it performs optimally wherever it’s used. This reduces toil and makes troubleshooting faster by ensuring that the data engineers need to manipulate during their debugging process are quick and responsive.
According to a recent study by ESG, 69% of companies are concerned with the rate of their observability data growth. Implementing this framework effectively will allow any company experiencing the cloud native observability data explosion to significantly improve the amount of data they store while improving the usefulness of the data for their engineers.
With observability data costs and growth right sized and under control, the organization can use the budget and time savings to do more and move observability from a cost center to a source of competitive advantage, further improving the ROI of their observability practice. Examples of ways organizations can uplevel their observability practice when they are able to control and optimize their data include:
The new world of cloud native makes controlling data based on the value it delivers, not just cost, a requirement to ensure the best possible observability outcomes. The best observability outcomes extend to more than just identifying and remediating issues quickly. They include using observability data to uncover new business opportunities, generating insights that help to improve the customer experience, etc. Cloud native opens up the opportunity to use observability data in entirely new ways, not possible with previous architectures.
The Observability Data Optimization Cycle outlines what is necessary to move your observability practice from a cost center to a source of competitive advantage. And Chronosphere is the best solution and partner to help make that happen.
Request a demo for an in depth walk through of the platform!