At this year’s KubeCon and CloudNativeCon, Chronosphere unveiled our latest enhancements to the Chronosphere Observability Platform: Chronosphere SLOs, available next year, and Differential Diagnosis, or DDx.
In our recent Chronologues episode, we broke down how these two capabilities work together to help developers prioritize the most critical incidents and then more quickly resolve those incidents with hypothesis-driven troubleshooting. This allows them to get back to work faster delivering innovation for the business (or even, back to sleep faster if the incident was in the middle of the night). Check out a transcript of the episode below and catch the full video at the end of the blog.
Introducing Chronosphere SLOs and Differential Diagnosis
Sophie Kohler: When a service goes down or starts acting up in the middle of the night, engineers often find themselves thrust into firefighting mode. They may find themselves scrambling to diagnose or resolve an incident, oftentimes on a 3 AM bridge-call. What if you could pinpoint the root cause of a service disruption and get things back on track faster?
At Chronosphere, we’re super excited to announce Chronosphere SLOs in preview and Differential Diagnosis or DDX in distributed tracing. These features tackle some of the biggest challenges developers are facing today when trying to keep complex systems up and running. Let’s dive into what these new and upcoming features bring to the table, and how they can help developers solve issues faster, stay focused, and most importantly, get back to sleep. Let’s talk about how organizations can use service modeling across telemetry types without proprietary agents. Hey Nate, tell us what you think.
Automated SLOs and a high quality customer experience
Nate Heinrich: This is something I’ve been thinking a lot about lately. More and more companies are adopting OpenTelemetry, open standards, and they want to own the telemetry data and its format that comes out of their system. But I think they’re making a trade off that they hadn’t quite anticipated: observability capabilities in exchange for avoiding that vendor lock-in. So, this is what me and my team have been working a lot on lately: This idea of automatic service discovery compatible with OpenTelemetry. This gives you out-of-the-box views that are both standardized and customizable within your business.
What do Chronosphere SLOs look like?
Nate Heinrich: We’ve added some kind of extra improvements to help folks with problems that we’ve seen them encounter over the past few years. The first is really just getting started – there’s a lot to learn. We are integrating SLOs into our service discovery capabilities so that they can become more accessible than ever.
Secondly, ongoing maintenance of SLOs can be burdensome. We built something internally for our own usage of SLOs that we are going to build into the product so everybody can use it. And it’s a dynamic, objective tracking capability. You can track on any dimension you like, separate budgets with one configuration. As new customers are onboarded, as new regions come online, that data is sent into the Chronosphere Platform and they automatically get their own separate tracked budgets and can be operationalized without having to update or manage your SLO configuration. I’m excited to share more, as we get closer to release in early 2025.
The inspiration behind Differential Diagnosis
Sophie Kohler: Those are just two examples of how Chronospheres is using SLOs to make them easier to use and easier to manage. Now, why don’t we take a dive into the age old problem of troubleshooting. Cavemen did it with literal fires, developers are doing it with digital fires. Let’s take a peek into Differential Diagnosis, or DDx, and how it empowers teams to remediate faster. Hey Julia, want to give us the lowdown?
Julia Blase: Yeah, sure. Differential diagnosis was really drawn from the medical community. We were thinking: “What’s this workflow that people follow when they find out that something is wrong?” You have your SLOs, you have your services, you kind of figure out: “Hey, there’s a problem and it’s coming from over here.” But, then you want to understand why that problem is occurring.
Julia Blase: So, we were interviewing current customers, people who aren’t current customers, just like SREs, trying to understand their playbook. It sounded like in general, when they try to figure out what’s going wrong, they have like a collection of symptoms, right? You start to build a hypothesis about what’s causing the issue. And, you really want a strong toolkit for that more hypothesis driven, testing workflow. We focused on making it really easy to use, like the power of a Google SRE in your pocket, so to speak, so that you could go through those same steps, even if you were brand new or unfamiliar with the system.
Remediating faster with a comprehensive dataset
Julia Blase: I think people are finding it easy to use, to go through this very common troubleshooting workflow. We designed it from the get-go to be really performant on high scale, high cardinality data. We think that you get the best diagnostic results when you’re looking at a really comprehensive dataset, and you can be confident that you really, truly are getting to the root of the problem, and you understand where it’s coming from, and there’s nothing you might be missing, because maybe you left a little bit of data out.
Sophie Kohler: Whether you’re a seasoned developer dealing with complexity, or you’re working through your first major incident, these tools empower you to resolve faster and reduce downtime for your customers. A full night’s sleep. What more could you ask for?
Stay tuned for the official release of Chronosphere SLOs, and in the meantime, check out our interactive demos on Chronosphere SLOs and DDX.