By Chronosphere Staff

Cloud native environments are a main component of digital transformation; they provide increased infrastructure visibility and flexibility for developer teams and engineering staff. But for adoption to be successful, leaders must realize it’s both a technology and organizational change. 

To provide insight on cloud native implementation and trends, Chronosphere CEO and co-founder, Martin Mao, joined Lee Atchison for his podcast, Modern Digital Business. They talked about a range of topics — everything from the demands cloud native places on observability to how migrating to the cloud requires a business to not only servers, but a change in how to structure and train technical staff.

Application monitoring vs. cloud native application scaling

Lee Atchison: Scaling modern applications is important. So what makes scaling a cloud native application fundamentally harder in your mind than a legacy application?

Martin Mao: Cloud native environments and architectures are just so much more complex because there’s so many small, ephemeral services that have dependencies on each other. And then on top of that, the responsibility for operating in that environment is now on the developer.

Lee: One of the things I like to say when we talk about microservices versus monoliths is that for monoliths the complexity is in the code. And in microservices, the complexity is in the interactions. 

Martin: And what used to be a simple function call over a network now introduces a whole bunch of other dependencies and complexities that are outside of your code base. 

Lee: So we talk about this trilogy between scale, nimbleness, and availability. So what demands does keeping that trilogy working  place on observability in general?

Martin: People want more scalable, more dynamic and more highly available applications. But these things come at a cost. There’s four specific costs that cloud native places on observability. 

The first is that data is being produced [at a] much higher rate than before. So you need an observability solution that can scale to handle that increased volume of data. 

The second demand is complexity. You need to be able to slice and dice that data in many more ways that perhaps you didn’t need to in the past before. So rather than just running on that single monolith, you may want to know exactly which host or which node you’re running on, which version it is and things like that. 

And all of these extra dimensions lead to the third problem, called higher cardinality. Which means querying and pivoting the data in a lot more ways that you didn’t need to before. 

And then the last thing is that cloud native is a very different type of architecture. This is where things like distributed tracing really comes in and becomes a core requirement of an observability system. Whereas perhaps in the legacy environment, [application performance monitoring (APM)] would’ve been a justifying solution there.

Lee: I was just going to say that traditional APM is really becoming less and less important. So what’s the difference between distributed tracing for observability and a more traditional APM approach to observability?

Martin: Really it’s about capturing the flow of a request through your system. Back when we used to have these monoliths, a request went through one giant piece of code. Today, with microservices oriented architecture, you don’t have those giant portions of code anymore.

And because these microservices do less and less, there’s a lot more interaction between your requests and the dependencies between all of your microservices. So there needs to be a different sort of tool set that can actually track these requests across and work through all of your different microservices 

Lee: You used a phrase earlier, and that was higher cardinality. That’s really gotten me thinking about its impact on observability. Because you’re right, when you’re running a monolithic application and you start your code path, it doesn’t change throughout the call for the most part. But a microservice based application where you’re calling separate services on separate hosts adds an extra level of dimension of data that never really existed before, but really is critical for solving some of these sorts of problems. 

Martin: So myself and my co-founder [Rob Skillington] solved this problem many years ago as the core observability team at Uber. And what we found with high cardinality use cases is, as you rightly pointed out, you need these dimensions to your data. Without it, you sort of lose a lot of your visibility. So what we did was to build a backend infrastructure that could scale and handle these dimensions in a fairly cost efficient manner. 

The three phases of observability

Lee: Now I know you talk about three phases of observability. Know the problem, triage the problem, and understand the problem. You wanna talk a little bit more about what you mean there?

Martin: A lot of people in our industry define observability as logs, metrics, and traces. However, producing those three data types doesn’t guarantee you observability. At Chronosphere, we look at it from an outcomes based approach. And for us, it comes down to these three things: 

  • Can you know when something is wrong? 
  • Can you triage how bad it is?
  • Can you understand the underlying root cause to go and fix it? 

To give a concrete example, you’re deploying a microservice and all of a sudden, you detect that your microservice and your endpoint are returning errors. The first thing you probably wanna do is to get the system back into a healthy state. And, you know, you can imagine rolling back that deploy to achieve remediation there, right? 

The Chronosphere platform stores all of your infrastructure and your application and your business metrics. And we have an alerting platform on top. The differentiator here for Chronosphere is that our sort of ingest latency is extremely low. So within, I would say a few hundred milliseconds, errors are being detected. And then our alerting intervals are at a one second interval as well. So you can imagine how quickly you can detect this and then you can also integrate it with [continuous integration/continuous deployment (CI/CD)] systems and automatically roll the deployment back.

But sometimes just knowing that something is wrong is not enough to remediate. For example, you can get notified of an error, but since you’re not actively introducing a change to the system, maybe there’s no rollback to do. So that’s when you would have to start triaging to know what is the impact and what’s really going wrong.

And this is where that concept of high cardinality comes in. So, you know, if you have increasing error rates, perhaps having a look at the view of, well, are the error rates happening across all of my deployments? Is it just one particular [availability zone (AZ)]? Is it just one particular cluster that this is happening?  And you may not want to do the complete root cause of, of what is causing it. You could remediate the issue simply by routing around that particular AZ that is impacted and just have your request go to the other AZs you’ve deployed. 

And the last one use case is root cause analysis. So ideally you don’t want to do root cause analysis while the issue is happening in production.  But sometimes you just have to do it. This is where you really want a system to handle more than just metrics and alerts. You want to bring in distributed tracing to really get to the root cause of what is your particular issue. 

So maybe you’re an e-commerce company and your checkout flow is broken. You know that your customers are not checking out successfully. However, when you look at all of your backend [service-level objectives (SLOs)], everything is green. Now when you roll back, you know you have all the time in the world.  Neither your downstream dependencies or your customers are impacted. As opposed to, you can imagine the stress in trying to figure that out while your customers are complaining while the other teams are complaining.

High data cardinality and early detection

Lee: I’d love to hear how Chronosphere helps with early detection. Say your service is starting to go a little flaky, but hasn’t created problems yet. But you know, it’s going to be a problem eventually, based on the trends of the direction that the specific metrics are going.

Martin: Even though you usually look at historical trends over the last hour or last day, but also there’s interesting trends that happen week on week, right?. Now there’s data science techniques of doing things like week on week standard deviation analysis to see how your trends are expected or not expected based on historical data.

And I think the second part there is, you know, as we talked about high cardinality earlier, we’re producing so much more data with so many more dimensions. And that requires a backend that can both store all of that data but be performant enough to perform all the analysis for you as well. Those are probably the two areas where Chronosphere differentiates against other platforms in this particular space. 

Prometheus for cloud native

Lee: But why can’t I just use Prometheus for cloud native monitoring? It’s obviously a common question, so what do you tell people when they ask you that?

Martin: Well, the first thing I say is that Prometheus is a fairly self-contained solution. So you can imagine with a single binary, you can get basic monitoring of your cloud native environments, which I think is fantastic. But it was designed to help folks get started in an easy fashion, not necessarily for companies to horizontally scale with a distributed solution.

One other thing that I tell clients when they ask me that question is, there’s an incredible cognitive load required to understand what Prometheus is telling you. And so trying to get useful tidbits of information out of that can be a challenge unless you have something on top of Prometheus or alongside it. 

Cloud native and cost efficiency 

And so in terms of what you need out of an observability solution, as you adopt cloud native, in terms of scale, the high cardinality support, the cost efficiency, and then the end user features. Those are really the tenets that Chronosphere was built around. 

As far as the cost efficiency is concerned, what ends up happening for many companies is that the bill for their observability goes through the roof. So all of a sudden, you not only need to solve these challenges, but the cost of solving these challenges becomes extremely high.

And this is where Chronosphere really differentiates itself in terms of helping a company understand the data that’s being produced and then help control the cost of storing that data. And on the end user interface side of things, we’re really starting to build more customized user experiences to help you do the root cause analysis. 

Lee: You know the piece that’s common with everything that you mentioned was the focus on cloud data applications. So you’re providing better solutions that are more customized for cloud native applications, but also provide them in a more cost efficient way. 

Martin: As you look at the landscape, every tool claims to do everything well. And, as we both know, that’s not the case.  So we’re trying to meet companies where they are and help them want to get to eventually, which means it is a journey. Because a lot of companies are going to have a hybrid mix of, some cloud native workloads and some non-cloud native workloads.

Chronosphere meeting market demands

Lee: As cloud native matures, how do you see Chronosphere expanding to meet whatever demands may be coming out of these mature cloud native ecosystems?

Martin: That’s a great question. So, as you mentioned, I think we’re just at the beginning of this migration. And I think because of this, a lot of companies out there are simply looking for more tactical solutions today. You can imagine as you roll out a new architecture, you just need basic visibility to your infrastructure and applications.  But we’ve seen that what ends up happening is that as companies mature, they start seeing observability less of a tactical must-have and more of a strategic tool that can give them competitive advantage over their competitors out there. So it’s about not only solving the infrastructure and the application observability use cases, but really start to uplevel observability as a whole and show them what positive impact it can have on a business. 

Advice for getting started with cloud native observability

Lee: So what advice would you give to the person who is just getting into understanding what observability means and the advantages of it. What should they be looking for in all of their cloud native understanding, specifically about observability?

Martin: My advice may be to think about not just the technology change, but the skill change that has to come along with such a migration as well. It’s a very different way of operating. And again, it’s not just that your architecture and stacks are different, it’s the organization mindset, the skill sets that developers need are very different.