In a world full of data, it can be difficult to try and diagnose errors (especially during a crisis). CEO and co-founder of Chronosphere, Martin Mao sits down for the 488th episode of The Overflow Podcast Ben Popper and Ryan Donovan. Together, the three chat about why diagnosing errors in production has become increasingly tedious, and how more data causes more problems.
Chronosphere is here to help companies not only gain key insight into their problems, but also helps automate responses, cut down on troubleshooting time, and successfully scale.
If you’re curious to learn more about the start of Chronosphere and our differentiating factors, take a look at a detailed breakdown of this episode below.
Introducing observability and Chronosphere
Ben Popper: Today is a sponsored episode from the folks at Chronosphere, and we’re going to be talking about application monitoring in a cloud native world when you’re getting up to some serious speed and scale. Previously, we’ve talked about some topics like this in the past. We talked about using the word “observability” as opposed to “monitoring.” Increasingly, I’m seeing interesting essays written, and even folks internally from Stack Overflow talking about the pros and cons of microservices versus monoliths, and that in some cases, people may have gotten a little bit enraptured with all the microservices and APIs, that they can get lost. And when something goes wrong, they’re not really sure where upstream and downstream is to fix it.
My sense is that Chronosphere tries to address some of those issues. Did I get that right?
Ryan: Yes. And I think that with all the data moving around in the cloud, it’s so easy to kind of divorce your infrastructure and your application from each other, that monitoring everything becomes a little more difficult.
Martin Mao’s steps into software and technology
Ben: How did you first get involved in software and technology? What was that first bit of code you wrote, and take us to that 10,000 foot flyover to what it is you do today at Chronosphere.
Martin: I first got involved with programming using VB and vb.net. Back in the day, a good friend of mine in high school was entering a programming competition, and he needed to do it with a team of three. I happened to be in his math class, and he needed two other folks that at least knew some math. He taught me how to write some basic VB code and we entered the competition and actually did quite well. I actually won a scholarship to study software engineering at university.
If you can’t tell from my accent, I’m originally from Australia. I ended up here in the US working at Microsoft out in Seattle and Redmond. Then, I worked at other companies along the way, including Amazon and Uber.
Martin: Three years ago, I decided to found Chronosphere with an ex-colleague of mine from Uber. Today, I am the co-founder and CEO of the company – so they don’t let me write code as much anymore these days. We provide a hosted observability platform for companies looking to adopt cloud native.
Catch up on the rest of this at 03:23.
The idea behind Chronosphere
Ryan: So, can you tell us a little bit about the idea behind Chronosphere and why you founded it? What sort of pain points does it address for programmers?
Martin: The story actually dates back to our days at Uber. Back in 2014-2015, they [Uber] decided that they wanted to have a microservices architecture and be more on a container-based infrastructure. For us to help them through that transition and especially from the monitoring perspective, what we realized was that the demands of a cloud native architecture are fundamentally different from our old monolithic, VM-type of architecture. This required a brand new approach and brand new type of solution. We ended up building that internally at Uber and actually open-sourced huge portions of that solution as well.
Martin: Fast forward to 2018, when all three cloud providers declared Kubernetes the winner and would be the standardized containerized platform supported across all three of the cloud providers. That was our realization that the rest of the industry was going to adopt this type of architecture. We just so happened to have a lot of the best technology to solve for a monitoring and observability use case, and thought that we should do something about that. That was the genesis of Chronosphere.
The demands of a cloud native environment
Martin: The demands that a cloud native environment places on us is in four areas:
- There are a lot more smaller pieces moving around that are more ephemeral. That generally produces a lot more data than in our old environments, meaning you need a solution that can scale a lot higher and handle a lot more data.
- That data is more complex, and has more dimensions on it. Perhaps the terminology you’ve heard of here is higher cardinality. You can imagine the ways we want to slice and dice data more than ever before, because of how complex these environments are. For that requirement, you generally need something that can handle the performance and read more data that you’ve had to in the past.
- More data, more complex data and more dimensions generally leads to higher costs as well. Imagine the cost of observing and monitoring – everything goes through the roof. You need a more cost efficient solution as well.
- The fourth and last is that, because the architecture is so different and distributed, you need specialized features to deal with this type of architecture. This is where things like distributed tracing comes in – where perhaps the old APM solutions, where they’re introspecting into a binary is less useful than a distributed tracing where you’re looking at a request flow across all of your microservices.
Martin: Those are the four demands that a cloud native environment places on monitoring and observability, and are the four things that we’re really optimized to go and solve for companies out there.
This is Chronosphere’s tech stack
Ben: Tell us a little bit about your tech stack. You were able to found this company – what kind of languages and frameworks did you choose? And why did you choose them, keeping in mind the four pain points you just walked us through?
Martin: When we were looking at solutions before we decided to build our own, it was actually pretty obvious that nothing in the open source world, alongside not much available in the vendor world, could solve these problems. We ended up having to build everything from scratch – from the storage layer, to the ingestion tier, to the query layer.
We decided to do it in Go, which seems to be the cloud native language of choice here, and I would say there are pros and cons of Go – but one of the benefits is that there are not a million different ways to do something in Go. Onboarding new developers is fairly easy, and reading the code is simple and standard for everyone to join.
Martin: The whole tech stack was custom built in Go from the ground up. We’re not leveraging any other storage technology. We’re not leveraging things like Kafka. That’s really what you need when you’re tackling these problems at a fairly large scale. You need pretty specific technology. The whole thing runs on Kubernetes as well. We are big believers in cloud native ourselves. So, as a SaaS solution, all of this technology runs on a containerized environment. You can imagine we’ve built a lot of services there. We, ourselves, adopt a cloud native architecture as well.
What Chronosphere solves
Ben: We talked about these problems – it kind of gets a little ephemeral, like the cloud itself. It’s hard to understand the concrete terms, right? Can you talk about some actual things going wrong that this [cloud native observability] solves?
Martin: Observability is the new buzzword now, but we’ve always had to have insight and visibility into our software, ever since we started to write the thing. If you look at the purpose of these monitoring or observability tools, those have not changed very much over the years. You really want to use them to do three things:
- Detect issues when they occur
- Be able to triage and troubleshoot them, and understand what the impact is
- Be able to root cause and fix the particular issue to remediate that problem
Chronosphere’s differentiating factors – faster detection, triaging, and understanding data
Martin: From our perspective, that has always been true. Now, it’s so much harder to do those things because the architecture and the environment we work in is so much more complex. To give you some concrete examples in all three areas: when we are detecting issues, all of these systems historically have emitted data like “what is the error rate? What is the latency of a particular endpoint?”
You can imagine that if you’re rolling out a piece of software and you detect that all of the sudden you have a lot of errors, the first thing you may do is just roll back that piece of software straight away and remediate the issues. The role that Chronosphere plays will be to collect that data, analyze it, and know that there are errors being returned now. Perhaps the differentiator for Chronosphere is that the latency in which the data becomes available in our systems is in the tens of milliseconds. So, we can detect that much faster than a lot of other systems out there. We can actually trigger the alert within a second as well. You can imagine that the main difference is that you can detect these issues within a second as opposed to a minute or ten minutes. The impact is heavily reduced there.
Martin: The second piece is what we call triaging or knowing what’s going on. A concrete example here would be if you get alerted when something goes wrong; maybe your error rates or latency is spiking. You want to know what the impact is. Perhaps you get paged at 2 AM and you wonder if you have to deal with it now, or go back to sleep and just snooze until the morning when you can be in a better frame of mind. Generally, you want to look at what is happening in the system, what the impact is and perhaps how bad the latency is. A tool like Chronosphere can collect this data and make it available in things like dashboards.
Slicing and dicing data while scaling up
Martin: Because we are really built for high cardinality, we allow you to slice and dice the data a lot more than you would normally be able to – like the latency from one particular cluster vs. another, one version vs. another, et cetera. Each time you’re going to ask more questions, you need more data to answer that. Chronosphere lets you store all of that data and scale up, but also make these queries fairly performant.
The last piece is root causing the issue – actually finding out what caused something and putting in a fix. For Chronosphere, we are not only looking at the top level metric data on what the impact is, but we can also capture a lot of the distributed tracing data. It shows you your customers flow and how that flow occurs through your particular system. You can see the particular microservice or part of the stack that is causing the issues.
Martin: And the thing that Chronosphere does uniquely is that it actually does the analysis for you, as opposed to just providing you the data. You can imagine, there is a ton of data, and it’s fairly complicated. You can definitely sift through it in a lot of tools out there, but for the tool to just tell you “Hey, this is where the root cause is,” is fairly differentiated from what we do versus others in the market.
Fires at the fire department
Ben: So, I guess the system is looking at all of these different flows of data, trying to respond quickly and isolate things within a larger ecosystem. What happens if your observability system goes down? If the fire is at the fire department, how do you fight it?
Martin: That’s a great question. It’s the worst nightmare of an engineer on-call, “something is wrong and I can’t see what it is.” And it actually happens more often than we think, because if you think about it, a lot of companies out there manage and build and run their own solutions. Those solutions run on the same infrastructure it’s in charge of looking after, right? So, if the infrastructure is broken, it’s actually the system that is telling you the infrastructure’s broken down at the same time.
Some companies out there think that will solve this by purchasing a vendor SaaS solution because it’s external. But what we’re finding more and more is that there are not that many cloud providers out there in the world. We generally all use cloud providers out there.
Martin: What we often find is that perhaps when AWS, US east one region in particular, has an issue and half the internet goes down, a lot of the vendored solutions out there are also impacted. And perhaps, if your production is in that region and your vendor is in that region, you are also going to be down at the same time. So, perhaps one pro tip is that even if you are using a vendored solution, go and ask where they are actually running in their backend and ensure that there’s not a common public cloud dependency there – because that is often a big cause of the issue. But, for us, the way we think about this is in one of two ways.
Observability recommendations – watching the watcher
Martin: The first is we need something to watch the watcher, right? Something has to tell us that the observability system is up. And for us as a SaaS solution, you can imagine we run a separate version of our own software to monitor our software.
And the fact that it is up, you can imagine the separate version is completely isolated in a completely different set of infrastructure than our main production environments. And perhaps that is one tip for folks who are running these systems themselves; you may need a second copy of it to watch the watcher. The other pro tip or recommendation is that for a lot of these things, just doing a pin check or “yes, the system is up” is perhaps not enough, right?
Martin: The way we think about it here at Chronosphere is not just “ is your system up and is it responding with a 200 check?” The way we think about it is, “is the system actually doing what it’s meant to do?” And not just the fact that it’s up. So, not only do we have another system looking upon our main system – we go and do things like generate random data and write it into the system, and right afterwards read that same random data and ensure that the whole roundtrip flow of producing the data, and reading the data, and your actual use case is functioning.
That’s actually the type of check that we do to go and generate what our SLAs are for our customers. So, it’s a lot more than just a “200, yes it’s up” check. It’s actually testing the whole system end-to-end, and then using that successful test to then give ourselves an SLA and measure our SLA that way. We definitely don’t try to ask the customer to ring us up when they see a failure and start the clock. Those are the worst.
Ben: As you were talking, it reminded me of an episode we did recently where there was a major internet outage in Canada. Someone at Rogers pushed something into production that they shouldn’t have, and nobody could fix it because their systems didn’t work. In fact, they couldn’t even make phone calls … So, only people with emergency SIM cards and other networks could triage the situation.
The observability landscape today, from Gen 2 to Gen 3
Ryan: It’s all networked. So, as more companies are going cloud native, the old application monitoring solutions were the dashboards, the system tracing and logging. What is the difference from that sort of generation one application monitoring to the Gen 2 platforms that are going to keep up with the greater speed and scale of the future?
Martin: Perhaps we’re on Gen 3 of these systems now. It was almost the original APM type of solutions in IT monitoring solutions that you mentioned. I think originally, they were mostly on-premise, and then we sort of saw the Gen 2 of these things, and they were mostly SaaS solutions. However, they were targeted at an environment that we all used to run which was primarily VMs and larger monolithic applications on VMs.
So, those environments, the larger volume and scale of data being produced, the higher cardinality of it and the fact that it didn’t cost as much, mean that the ideal solutions for an old architecture were what they are and what they are still today. And what is happening is that as the architecture is evolving, what we’re trying to do with these tools in terms of detecting issues and solving them, hasn’t changed.
Martin: The thing that’s really changed is that the architecture fundamentally looks very different now. So, as you look at more modern, perhaps Gen 3 solutions now, you’re really looking at solutions that are optimized for these particular environments. And I really feel like you can’t have a solution that’s optimized for both. It’s either good for one or the other, right?
So, the way we look at it here at Chronosphere is that we’re really trying to optimize for cloud native environments. There’s no reason you couldn’t use an old solution for the new environment or the new solution for the old environment. But if you try to do something like that, and a lot of folks are running into this now, the tools are not as effective or built to your point, providing you DTrace and STrace perhaps into the OS as opposed to how the request is really flowing across your broader architecture here.
Targeting high cost and overruns
Martin: Also, a lot of companies are complaining about the cost because the volume of data just increases. So, everybody’s complaining about the bills and how expensive these things are getting and starting to question the value of these things. So, I think the Gen 3 solutions are gonna be specifically targeting the pain points that get introduced when you do the architecture switch. And that’s probably the main difference there.
Ryan: Yeah. Just speaking to the pendulum, there was a story the other day about companies that are moving their machine learning models on-prem, so that they don’t have crazy cost overruns when they don’t expect them. And also for some of them it was supermarkets – how and when to stock the inventory on the shelves and, you know, they wanted latency not to be an issue. So, interesting how everything moved to the cloud and now in some cases, to avoid cost or for the benefit of speed is moving back.
Martin: I guess these are the larger cycles that we all live within. I’ll say for some of those, I think the power, the flexibility and the scalability of the cloud is great, but if you have consistent workloads that you know about, that are humming along all the time, perhaps an on-premise data center is just fine. I think for cloud native, with containerized platforms, I think the interesting thing is you’re going to run into these problems regardless of whether you’re in public cloud or on-premise. It’s more that if you are adopting a container-based platform across those two environments, that’s where a lot of these challenges are gonna be coming up. It’s the fact that you’re changing your architecture that way as opposed to running a public cloud versus an on-premise data center.
Building as a developer for developer burnout
Ben: One of the things I wanted to ask you about was your experience as a developer at these different places, and now as a founder. We’ve heard a lot through our developer survey and through our podcast about developer burnout and what it’s like to be working now in a largely remote environment. Do you feel that there’s something within the tools you’re proposing that might help with that? Does cloud native make it worse, and is good observability part of the answer?
Martin: My personal opinion is that cloud native definitely makes it worse for developers. Now, I think that going cloud native in that architecture does promise a lot of great things in terms of reliability and scalability. There are a lot of advantages to it, which I think people can see and hence the whole industry is moving in that direction. But a lot of the advantages come at a cost. And that cost, I would say is primarily on the developers more than anybody else. So, going back to my days when I worked at Microsoft, this was back when they had devs tests and PM, right? So, I would write my code, I wouldn’t even be in charge of testing my code. I definitely was not in charge of operating that code in production.
The scope of the role was so much smaller. Whereas if you look at today, you’re generally in charge. I think that sort of dev and test mentality has largely disappeared now. So, you are really responsible for writing that service and testing it. But today, it’s actually operationalizing it as well, like deploying it into production. Already there’s more scope and there’s more responsibilities there, but you can imagine in a cloud native environment, you’re deploying to a very complex architecture and environment where the infrastructure from underneath you is changing all the time. You have so many downstream dependencies that you don’t perhaps really understand. So, you can imagine as a developer, you were already probably pretty stressed to be on-call originally.