Having the right observability tools — especially designed for the cloud — are necessary to achieve this balance and help drive innovation
We recently discussed these topics in an online roundtable, How to drive business transformation with cloud native observability. Our Head of Corporate Communications, Parker Trewin, sat down with industry experts from Chronosphere and Aurora to delve into the benefits of adopting cloud native observability now and what observability can add to an organization. The panelists were:
- CEO and Co-Founder at Chronosphere, Martin Mao
- Site Reliability Engineer at Aurora, Craig Sebenik
- Product Manager at Chronosphere, Julia Blase
If you don’t have time to listen to the full on-demand webinar, read through the discussion below.
A look at Aurora’s tech stack
Craig: [Here’s] a little overview of Aurora, and what we do. The quick TLDR is that we do self-driving vehicles. The main focus at the moment is trucks; we have a couple of main offices. We are based mostly in Pittsburgh and the San Francisco Bay Area with a bunch of other [offices]. We are a highly regulated industry — getting more regulated every day — both by the federal government, and states, which have a variety of different regulations around self-driving cars. I’m sure people have seen the news around other players in this space. I wouldn’t necessarily call them competitors. Cruise and Waymo are some big names.
One of the quirks, [that’s] different from my previous companies, is we’re only in North America. We don’t have to deal with things like the General Data Protection Regulation (GDPR) , which changes some of the landscape for us, and makes things simpler. But again, we have to deal with all the regulations that are coming down the pipe. We plan on launching our first commercial product next year, so we’re still a little bit in the R&D phase. We are quickly moving away from that, but that also changes some of the conversations as we don’t have customers that we need to make happy every day.
We are primarily a Kubernetes shop. Most of the things that I work on are all in Kubernetes. However, we do have mobile computers, right? There are computers in the vehicles, and we run a route. If you go to the website, you see the route that we run across the state of Texas: from the Gulf of Mexico all the way over to Arizona. It’s a very long route in the middle of nowhere. These trucks are computers. If you want data from these computers, you pretty much need cell coverage in the middle of “Nowhere, Texas.” That can be a challenge, to say the least.
We are primarily Kubernetes, but we have random EC2. There’s hardware to support the vehicles in our depots (where the trucks pull in, where they leave from) there are various computers there. One of our other big challenges to run our simulations, and a lot of the stuff that leads to the software that actually gets deployed into the vehicle, is a big asynchronous system. The important part there is that it scales up and down significantly — over almost an order of magnitude in a given day.
Cloud observability at Aurora
Craig: I do need to mention that cloud observability is not the only observability we have at Aurora. We do have vehicles, but a variety of things in the garages that need some kind of observability. Those are not the things I work on. We provide a framework that can be used by those teams, but they present their own little complications. So, I’m talking mostly about cloud observability.
I break it down into four basic parts: metrics, logs, distributed tracing, and then exception handling. For metrics, there’s obviously all the Kubernetes stuff. We started with Prometheus and Thanos. I’ve been at Aurora for over three years. And we started with a pretty typical Prometheus Thanos setup. If you’re not familiar with Thanos, it’s essentially just high availability (HA) Prometheus. We had a pretty standard setup with things like alert manager familiar with the Prometheus space. We also have CloudWatch data for AWS services. We run in AWS, and we need data from say, RDS or S3 or SQS — the various AWS services need to send data somewhere — and we also need things from everywhere else, and that’s where we support StatsD. For logs, we have a variety of things from CloudWatch, to some custom solutions in S3 to a different vendor that handles our logs.
Aurora’s migration to Chronosphere
Craig: Why did we make the jump [to Chronosphere]? There are two main pieces I’ll discuss: personally, and why Aurora did. Personally, I really saw the power of observability done well when I was at LinkedIn. We were on a massive scale, growing like a weed. We were a couple hundred people when I joined, and over 6,000 when I left. It was a lot of scale: A hundred million users, and 300 million when I left. I saw the power of providing data to all the different developer teams, to how much they could self-support if they had the right information to make the right decisions.
That’s where I personally saw the religion of observability relatively early on. At Aurora, I mentioned the acquisition with Uber Advanced Technology Group (ATG). The big challenge there was that we were several hundred people. We acquired another company that was twice our size, so we grew by 200% overnight. There were all kinds of challenges with tech integration that say nothing about people integration, and how we can make the most improvements as quickly as possible. Throwing people at the problem wasn’t going to work. And that’s when we decided to partner with Chronosphere. We really thought Chronosphere could solve a lot of our problems a lot faster than we could.
Why are businesses investing in cloud native?
Parker: Martin, you see a lot of customers. You’re up front and center, with a lot of leading tech companies, and you kind of have your ear to the pulse of what’s happening on the street. I’m wondering, from your perspective:
- Why are people investing in cloud native?
- What are the things they’re investing in?
- What are the “gotchas”, and what do people need to be looking out for?
Martin: I’d say that we’re definitely seeing, especially compared to a few years ago, a lot more companies adopting cloud native architecture. And when we say cloud native, we mean predominantly running microservices on a container-based architecture there. I’d say a lot of the reason is because of how quickly it enables you to run. If you look at most modern businesses out there, the customer expectations have risen so much in the last decade or so. You can imagine, a decade ago, the iPhone was barely there. If you had a product or service a decade ago, maybe when you had an outage, the newspaper would write about it. Whereas today, your end users will tweet at you live as your product fails. And it’s not just from that one user — all the other users see it as well.
On top of that, access to information is so easy that anybody can go to a competitive product instantaneously. I think the broader ecosystem, every company’s competing in now and the end consumer expectations are so different from a decade ago, that is what’s causing companies to make a change.
You need to be shipping software almost on a daily basis, perhaps multiple times a day, and innovating at that pace to keep up with the customer demand, and to innovate and out-compete with your competitors. It’s really that broader ecosystem that’s forcing people down this path of going cloud native. Let’s be honest, it’s not an easy architecture. Nobody does it because it’s an easier thing to do. It’s actually a far more complicated thing to do, but you’re doing it because there’s a business need to do it. That’s the broader trend that we’re seeing — there is definitely a lot more adoption, and it’s because that adoption is driven by demands on the business [side].
The unexpected parts of cloud native adoption
Martin: I’ll say the two “gotchas” that we talked to a lot of companies about, the first one is thinking about this shift to cloud native as a pure technology shift. Everybody thinks: “Great, I’m going to take my monolithic application on my VM, I’m going to break it up into microservices and run on containers, and that’s it. I’m done.”
While that’s part of the problem, it’s almost the easiest part of the problem. What people forget is that this requires a pretty large organizational shift. Because, you’ll remember in the old world, I would write code, and somebody else would test it. I have no idea how it got into production. In fact, back then, it was boxed products. I have no idea how they would print it onto a CD and ship it out there.
But in modern days, the developer who’s creating the software has to test it. They have to deploy to production, and operate it in production as well. That is a pretty big shift, organizationally, for how you structure an organization in order to prepare for that. Then, [it’s about] how you think about the skill sets required. Do your developers have the skill sets to operate in production? There are all these other pieces, outside of the actual technology and architecture change, that I think is the first thing that people overlook or don’t put enough time focusing on.
Then, perhaps the second “gotcha” is that companies assume that the tools they’ve used before will work in the new world as well. They assume that: “If I already have a security tool, if I already have an observability or an application performance monitoring (APM) tool, and it works for me on my VMs, on my monolithic application, it’s going to work for me in this new world. They’ve served me well before they’re gonna serve me well again.”
That assumption leads you down a pretty bad path. You can imagine there’s a reason why there’s such a large industry around observability, security, and CI/CD right now. Because, fundamentally, the architecture and the way you’re operating and creating software has changed. Hence, all the tools you need to observe it and all the tools you need to secure it have to be different as well. I think everybody assumes I’ll just port everything else across. That just doesn’t work out.
The promises of cloud native
Julia: It was really interesting to work in observability at Palantir when I did. I think Palantir started with everything on-premises and in fact air gapped on-premises, and realized that they had to make this shift to address sort of performance security concerns and cost management concerns.
The other promise of this cloud native microservices architecture is you scale up and down, you only pay for what you need and you think: “Hey, this is going to be easy and great and cost saving. There’s so much entropy though, as you start that migration. I think the entropy accelerates with cloud native, right?” You have one system for on-premises, you start with a little bit of a migration of one part of your system to cloud native.
You say: “Do I really need a big cloud native vendor? I can just put something in Elasticsearch for now.” Then you move over to something else, and all of a sudden you have eight observability systems monitoring 19 different parts of your platform with 300 teams, and you are spending all of your time firefighting that observability, entropy, and [think] “I have too much data. My developers can’t find what they need. They dig through all this junk to find one thing they need to root cause this problem. My developers needed to control their costs. I told them to cut data, [so] they cut the one piece of data that they needed, right?”
This is the other side of: “Shoot, I cut the thing that I need to solve that problem that’s biting my customers today.” This extends to the management of those systems as well, right? You have your SREs typically, or your DevOps org. They’re managing these observability tools. They were managing one, now they’re managing several. And that drives entropy and management exhaustion.
I think there’s really kind of a hidden cost there as well that people don’t always think about, which is the opportunity cost of those SREs, and how do you spend that management team’s time at your company? Do you want these smart people to be answering help desk tickets or driving you forward and picking the actual solution — or maybe the three solutions — that you need to manage all of these things at once? The migration is never as easy. The transformation is never as quick as you think it will be. It’s easy to stumble forward, instead of building in time, to reevaluate your options and make sort of a concrete decision.
Observability is crucial for modern businesses
Martin: We talked to a lot of companies about this cloud native journey, and we advised a lot of them to think about observability early on — just like thinking about security early on. Honestly, the biggest reason is that the cost of not thinking about [observability] early on, is pretty clear.
There’s studies out there, conducted by third parties, that show that if you don’t have good thought into observability upfront, your engineers or your developers are spending 25% of their time debugging issues. That’s a quarter of the week just trying to figure out what is going on. There’s an ongoing cost there.
If you don’t think about these things ahead of time, especially from the hard cost side: “What is the cost of these tools and the data?” that’s often going to blow up in your face in terms of the unexpected bill you’re going to get from your observability vendor as you do this transition. I’d say that if you don’t get ahead of it, there are some pretty bad costs of not getting ahead of these things. It’s one of the few things where it’s also much harder to undo.
If you are in this new [cloud native]environment and operating, and you have the wrong tools or cost profile, and you need to go do a switch and then yet another migration, it’s really hectic to do that. It’s much easier to take a little bit of time to design your observability strategy. It’s not even about the tooling, it’s more about the strategy. Design that well up front, put some thought into it in the same way that you would try to do this in other places. You should think about security, you should think about CI/CD. Because these are things that every developer has to deal with in this new environment, and a central team can put some thought into it, thus it could save every developer a ton of time later on.
The cost of not thinking about it is big. Every person that I’ve talked to that has successfully been through a cloud native migration, all agree that if they could do it again, they would rethink and pre-plan. It is one of the biggest learnings that they’ve taken away from going through a migration.
If you don’t, every developer has to figure it out themselves. They’re going to make mistakes themselves. There are a few things that will impact everybody in an engineering organization. In those few things, I’d recommend putting some upfront thought and design into it.
Continue this portion of the chat at minute 27:42.
Manage costs and customer experience with observability
Craig: An issue right now for us that I think is pretty broad, given the economic climate of the world, is cost. Managing costs is a big one. That means managing costs with vendors. But for smaller companies, it could be managing the costs of charging the right amount for a given product, or: “Is this given product that we offer customers really providing any value?” When you have all this data and you bubble it all up, is this actually providing something that we are cost positive on? All of that eventually comes from metric data: “How much memory does this thing consume, how much CPU, and how much disk space?” Ultimately, it comes from AWS, GCP, and Azure. And they’re not really fungible costs.
Martin: Especially as companies go towards cloud native, an interesting concept I’ve seen is in the older architectures, it’s almost like your application endpoint maps one-to-one to your customer experience. What I’ve seen now, as architectures get distributed and now you have an endpoint like a login, but there’s 18 microservices behind it, each doing their own piece of it, you almost need to monitor each of those microservices, and each of those applications. We almost need a separate view that’s just what the customer experience of your product is, because it used to be possible to get that nearly for free. Now, you need a separate view there. What’s interesting about this separate view is that it starts to get you to ask questions like:
- How do you set SLOs for each of these individual services?
- If I break an SLO (or my service) how important is that?
- Does the break really matter if it doesn’t impact the end customer there?
It gives you another view of your business, perhaps in real time, and that view helps inform how urgent a backend issue is, or not urgent. Perhaps that view helps you prioritize how urgent it is to get more performant.
Or perhaps, like I’ve seen sometimes, you get too performant. You’re like: “The customer expectation is two seconds and we can do it in ten milliseconds.Well that’s great, but are we paying too much? Are we using too much hardware to give them a ten millisecond response, when two seconds is fine?”
It starts to bring up some pretty interesting topics there. It’s a view that is one never thought about when we had old architectures. I think it’s one that modern businesses are now using to help them operate their business in a different way.
Catch up on the Q&A portion of this chat starting at minute 36:44.
Interested in how Chronosphere can help you adopt cloud native observability? Schedule a demo today.