At Chronosphere, we think about observability a little differently than most. Our focus is on the problems customers need to solve to be successful, and we spend time with them understanding their highest criticality use cases. They tell us stories, describe on-call scenarios, and share wish lists that have all informed the product we are building. Today, that product is expanding to include another data source designed to enhance our customers ability to be notified of issues, triage those problems to determine severity and urgency and understand the cause of the problem so remediation can be put in place. I’m pleased to announce the preview of the ability to ingest distributed traces at scale, seamlessly integrated alongside with metrics, to more rapidly triage and understand the root cause of problems.
The challenges with existing distributed tracing tools
Across the industry, there is broad acknowledgement that there is incredible value in being able to trace a transaction across a distributed microservices environment — especially for large, born in the cloud businesses. Distributed trace data can do this in a way that metrics and logs can’t. And yet, distributed tracing hasn’t really lived up to its potential. Too often the cost and complexity of a stand-alone distributed tracing application leaves engineering organizations with too little value to justify the continued time and monetary investment. Let’s focus on some key reasons why distributed tracing has yet to realize its potential:
- Too siloed. Distributed trace data is unique and the typical deployment is separated from the data engineers use every day — especially metrics and dashboards. Systems that force the user to find some artificial way to bring the context of their investigation with them to their distributed tracing application often fail for this reason. Imagine finding something interesting or suspicious on a dashboard and having to search through your logs to find an ID to plug into your distributed tracing application. It’s too time consuming with no guarantee the search will help in your investigation. After a few failed attempts most engineers give up on the tool.
- Too complex. Unfortunately for engineering organizations, distributed tracing has become the purview of power users. The stand-alone distributed tracing applications can be very complex, requiring the user to have deep domain knowledge of both the system being investigated and the inner workings of the tracing application.
- Incomplete data. As an engineer you’ve taken the time to bring the context you need to the issue. You have an understanding of the system you’re investigating and you’re a distributed tracing ninja. You get to the point where your efforts should bear fruit only to find the data is incomplete. You can’t find what you’re looking for because it’s not there. A critical part of the system isn’t instrumented to produce tracing data or your organization sends and stores only a small (sometimes 1% or less) sample of the data.
- Too expensive. Distributed traces generate a lot of data. Not having proper controls in place makes it difficult to manage the signal to noise ratio of your data. When it grows uncontrolled, finding answers in the ever increasing data is more difficult. It’s also more expensive.
Breaking down the barriers to distributed tracing success
As we developed our ability to ingest distributed trace data, we focused on how we would address each one of these barriers to adoption. This preview is a result of the hard work our team has put into addressing these challenges. We use distributed trace data in conjunction with metrics data to provide deeper insight into alerting, triage and root cause analysis.
Let’s walk through how we are working to address each one of these challenges to help our customers harness the full potential of distributed trace data:
- Make better decisions with complete data. We have the scale to capture, analyze and store 100% of your distributed trace data. Sampling data due to cost or capacity limits is the norm throughout the industry. If you are using sampled data to make decisions then those decisions are based on a statistical representation and not the actual data. Imagine the ability to make a query with the confidence that the response represents all and not a sampling of your data.
- Deep contextual linking of metrics and traces. Chronosphere is a single product company — the new tracing capabilities are tightly linked into existing metrics functionality. This enables engineers to start from the broadest context and scope down to the narrowest, which is a more natural process flow. You can use a graph on a dashboard you are monitoring as a jumping off point directly into the traces related to any point on the graph you select. Directly linking traces from dashboards eliminates the time and loss of context created when switching between siloed applications. The issue you see on a graph immediately is realized as traces related to that issue. All of this can be achieved with standard trace and metrics instrumentation such as OpenTelemetry. Nothing proprietary is required.
- Democratize usage. Our interface lets each engineer use the tool to get answers to basic queries – even if they are not a distributed tracing expert. At the same time, our ability to scale means an expert can make complex queries to zero in on root cause knowing they are getting responses that represent all of the data. We can analyze millions of traces at a time if that’s where the answer is. We also give you tools to look at sets of traces and analyze what is different in the two sets. Our capacity to analyze very large sets of distributed traces helps users find differences in those distributed trace sets and point to the root cause.
- Scales to any environment with control. We’ve proven with companies who are operating some of the world’s largest and most complex systems that we can scale to meet any demand for data ingestion and queries. At the same time, we give you the visibility and control to decide what data you want stored. Our controls enable you to filter out less valuable information from your data such as unnecessary metadata. We can also eliminate entire traces that aren’t used in troubleshooting like health check endpoints. Our tools enable teams to make decisions about their data that aligns with business and operational goals.
- Maximize the value of the distributed trace data you have. We have developed a unique approach to capturing and storing data that makes trace queries more flexible and open. Our platform, using distributed trace data, can search across holes in traces caused by missing spans of the transaction. Too often, there isn’t a common thread through the entire data that allows the user to find what they are looking for. We’ve developed a way to give you answers to queries when other systems can’t. Those queries might be in search of a specific trace or they might be an analysis of a set of distributed traces. Especially in large, complex systems the answer to root cause is often not found in a single trace.
Today, we are previewing our ability to ingest distributed tracing data with select customers and partners. With the ability to capture, store and view every distributed trace that matters we provide a level of fidelity previously missing in the industry. The power of metric and distributed trace data together elevates both for engineers who use them everyday — especially when it comes to root cause analysis. We deliver it all with a single platform designed for engineers who want to ask simple questions about their data without compromising the extensibility to generate complex queries for power users.
Our customers told us what they wanted to do with their distributed trace data. We are excited to help them do just that — and more. Request a demo of our new distributed tracing capabilities today.