A guide to successfully monitoring your Kubernetes environment

A ship wheel with cubes on a dark background, providing a visual guide for Kubernetes monitoring.

Blog

Cloud native architectures impact every aspect of the DevOps lifecycle, including faster deployment and greater scalability. But, if you’re trying to bring your existing monitoring tools that once worked in the VM and monolith-based world with you to monitor Kubernetes, you’re going to face quite a few challenges.

19 MINS READ

Hear about this topic from Chronosphere’s Senior Developer Advocate Paige Cruz and Google Cloud’s Developer Relations Engineer Mofi Rahman in our webinar, “How to effectively monitor your Kubernetes environment.”

Over the course of an hour, the two chat about why monitoring Kubernetes is different from monitoring VM-based infrastructure, how to effectively monitor your Kubernetes environment, software selection tips, as well as how Google and Chronosphere partner to deliver the industry’s most reliable and cost-effective cloud native observability.

If you don’t have time to listen to the full conversation, check out a transcript of their chat below.

What observability is and why it’s key today

Mofi: I’m going to get us started with how to effectively monitor a Kubernetes environment. No talk about Kubernetes is complete without mentioning our predecessor – and to some degree, competitor – VMs, right? We are on a journey continuously over the last 10 years where we’re moving our workload from bare metal to VMs and VMs to containers.

I don’t think that journey is going to be complete, because there is going to be a workload that still runs and works on VMs. But, for a lot of organizations, it seems that moving to containers is a step up. In some ways, it makes their lives easier, and in some ways harder … In the world of Kubernetes and containers, we kind of broke [applications] down into smaller domains. We have smaller teams that are doing smaller things, but those teams now interconnect. Our system becomes more fragmented.

On one hand, we get a lot of benefit by being able to develop and deploy our application as a single small piece of binaries. But on the other hand, our application needs to talk to each other, which is not immediately apparent. You don’t have a [person] anymore, who you can just go and ask what’s happening. Now, you need to have an understanding of a complex tree of things that are talking to each other.

With that, the challenge we get a lot is with observability. I’m going to define observability again, just because we need to level-set for everyone in the room. Observability as a noun means capable of being observed. If we had an application, like a single application deployed on a VM, everything that happened on the application was visible just by looking at the logs of that VM. But all of a sudden now, in this interconnected map of small microservices, being able to see everything that’s happening becomes a whole different task. It’s no longer: “Look at a log of this one thing, and you’re done.”

Continue this portion of the chat at minute 02:11.

Why do we care about observability?

Mofi: Why do we care about observability in the first place, right? Like: “Our services were working on VMs before. Now they work on Kubernetes. I don’t really care, as long as they work.” Well, we care for a few different reasons. We want to know that our application is doing the best it can. We want to know when something goes wrong, how to go and fix it. We want to be able to optimize our service to take advantage of this cloud environment we’re in. We don’t know, without looking at it: “Is it doing the best it can?” We want to be able to do capacity planning, we want to know that: “Okay, I am in the AWS now and the holiday season is coming. I know I’m going to get a lot of international orders. I might need a capacity plan before, so that I can scale up and take advantage of the cloud scale systems.”

And for security: If I’m getting a DDoS attack, I want to know that before my services start going down. For compliance: If you’re in an industry like finance or government, you need to be able to showcase a certain level of compliance before you can even get an application deployed in your systems. Or if you’re in finance, you need to show a certain level of security compliance with observability before you can do certain things with certain entities.

Continue this portion of the chat at minute 04:43

Mofi: With the evolution of observability, this tracks pretty well with the evolution of cloud native as well. In the beginning, we had bare metals and VMs, and then someone could just search into the machine and go see what logs are being outputted in the .log file. Those wizards among folks could read those logs and understand what’s going on, which is great (props to them, but I’m not one of them.) Then came metrics. We thought: “Logs are great, but what if I could just send actual number data, tell us exactly how my machine is behaving?” That’s great. And for a single VM metrics, it’s probably all we needed, to be honest.

How is monitoring a Kubernetes environment different?

Mofi: Kubernetes is a dynamically scaled environment, so it can scale up to as much as you have resources. Kubernetes is controlling how things get scheduled [and] where. A lot of the assumptions we could make are: “That’s the IP of this VM, and this VM is here.” All of that is happening dynamically, so you can’t really assume where everything would be. Again, [that’s] networking and load balancing. It’s also dynamic. Kubernetes has an internal domain name system (DNS). You have external DNS connecting your application to the outside world. All of that is no longer a static thing. It constantly changes.

For pod and node lifecycle management, your application can be killed by the orchestrator for a number of reasons. You could just assume that something will be there. You have to do a lot of health checks and understand things are there, and if they’re not, what can you do to make that come back?

Finally, cluster-wide resource management. You have this resource pool that is probably used for multiple applications that’s not guaranteed for just your application: The noisy neighbor problem. That was not a thing in the VM world, because each application just got their own. All of the sudden, in the Kubernetes world, we are sharing the same resources for multiple applications. So, it gives us a huge benefit of being able to use our resources more efficiently. But, now we have to think about our neighbors, right? We have to play nice with other people that are sharing the resources with us.

Defining goals with Kubernetes observability

Mofi: I’m going to give you 10 simple steps (with an asterisk) for Kubernetes observability. Once you do all of this, we’re all going to be great with observability in Kubernetes.

Define goals
Have to set up things like: “What do we want to actually set our goals as? How much uptime should we want? How much downtime can we afford? What kind of resource spike can we allow, and how much money are we willing to spend on the upper limit?” All of that, we have to define from the get-go.
Find a combination of tools
[Paige and I] are both vendors here. I would obviously assume that our tools are the best, but it might not be for your use case. Find out what tools and combinations work for your use case, and then understand those tools well, because a lot of tools can do a lot of things — as long as you’re willing to put in the work.
Instrument code
Paige probably feels very strongly about this, but we want to have code instrumented so that we can get the most out of the tools we’re using. A lot of tools work out of the box without us doing much work, but we get the best amount of information, to be able to take the best course of action, by instrumenting our code. So, instrument your code.
Collect and visualize
Your tools will collect the data, and most of the tools come with some sort of dashboard system. Create those dashboards, and have those visualizations ready so that, at a quick glance, you can understand exactly what’s happening in your system.
Track resource use
You want to understand how much resource you’re using, and what resource, or application is using too much resource. And with the dashboard, you can start tracking: “This system used to use “n amount” of cores last month and all of a sudden this month they’re using five [nano] cores.” Good news, you have a lot of new customers, or bad news something has gone wrong. Either way, you want to know.
Logging and log aggregation
I feel like as we progress further and further into this cloud native world, logging was really cool and then kind of dipped in popularity. Logging is really cool again, because you need to be able to log, aggregate, and understand the lineage of a certain thing that has happened in your system.
Distributed tracing
If you don’t instrument code properly, distributed tracing wouldn’t do much. Again, these are incremental steps. [First] you instrument your code. Now, all of the sudden you have distributed tracing. You can see what a request did in your system, throughout the system. You can see: “Okay, I have this one function that’s taking a really long time. What has gone wrong?” You dig deep into it. You understand exactly that either something went wrong or you have some issues.
Alerts and notifications
Once you have done all the previous steps, you want to be able to send alerts and notifications to yourself or your team, so that when something does go wrong, someone is out there looking out for it.
Continuous re-evaluation and fine tuning
The first eight steps are things you do, and then you just keep doing them every day until you figure out: “Okay, this is working well, this isn’t doing what I wanted to do.” Observability is a never-ending journey. You start it, figure out something works, and you find out all the ways that it doesn’t work, and what you can do better.
Best practices and updates
As much as we’d love for our systems to be future-proof, they never are. New things come up, change, and maybe you find out that the software you’re using has new updates and you haven’t updated in six years. You have to keep up with it.

Tools for Kubernetes observability

Mofi: We are inundated with choices [of tools.] We have so many choices. All of the tools that I’m showing you are taken from the Cloud Native Computing Foundation (CNCF) landscape. TheCNCF landscape probably covers 70% of all tools that exist. But there are tools that are outside the world of CNCF. Because we’re talking about Kubernetes, CNCF is probably a good place for us to start. Again, if you want to break down the tools in multiple parts, you’re probably looking at something like open source: There, you have CNCF projects, member projects, and non-CNCF member projects, and then you have proprietary.

Proprietary would be the tools that both [Paige and I] work in. Kubernetes is open source, but GKE is a managed version of it. We have our own logging and monitoring suite called Operations Suite. Chronosphere is also a vendor that provides proprietary solutions for that kind of stuff. But, let’s look at some of the open source tools. These are sorted by GitHub stars. The first one has the most number of GitHub stars at about 62,000, and we have all the way down to 400 GitHub stars. These are all projects that exist in the space.

A few months ago, I actually tried out half of them to just experiment how they are … Some of them have really nice UI, some of them have nice collectors. But generally, they get the job done. If you want to invest the time and energy to figure it out, you can do really great things with most of these tools.

Continue this part of the conversation at minute 16:32.

The world of proprietary monitoring tools

Mofi: Then, you have the world of not open source. This is where both of our companies live: The Google Stack driver and Chronosphere, these are both products that will allow you to do things [in Kubernetes]. All the cloud vendors happen to be [in this category], like Amazon, Azure, Google, and IBM. Then you have specialized tools, again, like Chronosphere, Datadog, New Relic. All of these tools do very specific observability related things. Depending on what you’re looking for, that might be really powerful for your stack.

For open source logging tools, Elastic has the open source version of things, but they also have a managed solution. For tracing tools, again, Honeycomb is doing some really cool stuff. And LightStep, they basically have this kind of managed solution that [is interoperable] with OpenTelemetry now, but I’m not 100% sure. For the cost optimization tools, there are a bunch of them here as well. I think I have seen the most from cast.ai, but the other tools are doing cool things as well.

Choosing the right tool

Paige: When people ask how to choose the right tool, I always say that it depends. [Which is] everybody’s least favorite answer. Today, we are going to talk about the different factors you want to consider, that go into making a choice most beneficial for you. It depends on your use case. There isn’t one way to monitor Kubernetes or do cloud native. It depends on your people, your organization and business goals, the stack that you’ve got running, and a few other things that we will talk about.

One choice you have, or a place that a lot of businesses start at: Open source where you are self-managing it. Sometimes, we call this do-it-yourself (DIY) observability, or in-house observability. This is often a fit. If cost is a major concern. If you do have the team that has scale, bandwidth, and the experience running, operating, hosting, and monitoring [Kubernetes]. You don’t want to make the mistake of putting your Prometheus cluster in the same availability zone as you’re running your applications for production that your customers are using.

There are a lot of considerations that you want to think about when hosting this stuff yourself. If you would like to be on the cutting edge of innovation, and you want to keep up with the latest releases, or you don’t want to have to wait for a vendor to adopt whatever the latest release is, of course, open source is the choice when you want to be in control of your data and your observability destiny. You don’t want to be locked into one particular solution.

Plenty of people, up to a certain scale, work just fine with DIY observability. That is the first place, or entry point, for a lot of organizations.

Choosing a cloud provider solution

Paige: The next step is choosing a cloud provider solution, like the Google Operations suite. AWS has CloudWatch; this is great if you’re in one cloud. If you go multi-cloud, you will want that full-picture experience, because that’s what your customers are getting. Your customer doesn’t choose what cloud their request gets routed to, but you are on the hook for the customer experience. If you’re on one cloud, why not use that cloud provider’s analytics and monitoring? Because it’ll be the quickest time for you. You definitely get a clear price advantage there.

Oftentimes, this is bundling, or you get discounts – depending on your usage – and it is the deepest integration with the existing infrastructure and cloud that you’re running on. That makes total sense to me.

Choosing a managed service

Paige: There are a lot of use cases where you may go multi-cloud, and you want to start considering other options – like a managed service:

If you’ve got a hybrid infrastructure.
If your team doesn’t have the bandwidth to self-manage a stack.
If you get to the point with DIY observability, where you’re hitting scaling limits and dashboards are taking longer to load.
If you’re worried about the alert evaluation pipeline.
Your last ops team member just quit because they burned out from running Prometheus, and they went to a company that was using a managed service.

There are a lot of factors you want to think about with the health of your operations organization, and people are definitely a big factor in whether to go managed or self-hosted. Managed services also give you features that are only available for that specific proprietary provider. There are some things, like sampling or tuning, that particular vendors provide that aren’t available in open source. There is a benefit and advantage to going with managed services.

Another flavor of a managed service is the managed open source service. That’s sort of the Goldilocks, the golden path, where you get the benefits of the portability of your data, [and] the control. No vendor lock-in. You get all of the benefits of the open source tool, the community, the ecosystem, and the tutorials.

If you’re going with open source, odds are your engineers are familiar with how to query that data, how to emit that data. You don’t need to teach them [someone’s] proprietary query language. This also gives you an off-ramp. If you would like to go back to self-hosting, maybe you go to manage for a while while you’re staffing up the ops team. But you always have that ability to go back and host because it’s your data that is open source.

Embracing open source standards

Paige: And open source is a very big part of the cloud native ecosystem, especially when you’re talking about Kubernetes. You cannot adopt Kubernetes and be anti-open source, because all of these LEGO pieces of observability fit in together. It’s important to think about embracing this as a two-way street.

You will be an end-user or a consumer of this awesome open source technology, but you can also free up your engineers to commit back if they notice there’s a small bug that’s causing you problems.

You don’t have to wait for a vendor to get it through their customer success process, all the way to their engineer, to getting it fixed who-knows-when in a sprint three quarters down the line. The open source-ness kind of goes both ways.

Controlling cloud costs with Chronosphere

Paige: All of this data comes at a cost, as we talked about at the top of the hour. Controlling cloud costs, controlling the observability spend: How much are you spending on metrics versus your server infrastructure? This is a question that your organization definitely will be grappling with, if you are not already.

And what are some things you can do about it? We have the Chronosphere Control Plane. We give you different levers along the observability pipeline to understand what data you have — analyzing what is useful, and what is waste: “Are people even alerting on this metric? What label is causing the biggest amount of cardinality and costing me the most, but is used the least?” We give you that transparency.

It was something I would’ve loved to have as an SRE because doing that manually, or fighting with another vendor to get that visibility is a lot of engineering time, a lot of expensive engineering time. After you understand, you get the landscape down of: “Okay, here’s my data, here’s how useful it is.”

Then you can move into this higher level of observability practices, where you are delegating. You can set quotas and say: “Every team gets x amount of observability spend, or here’s your quota for the month or quarter.” We can do cost accounting and get the trend views. That is really, really powerful in this world of microservices — where you have teams running independently. You’ve got a Ruby team, Elixir team, and a Java team who don’t speak the same language and who have different goals.

The delegation turns to your central observability team (COT) or your SRE team. It stops them from being traffic cops and chasing people down and saying: “You made the bill so high. Un-instrument.” An SRE never wants to hear “un-instrument.” Our whole goal is to get people the most useful observability data. Delegation really puts the responsibility of: you build it, you run it, you own it. It does that for your observability data.

Chronosphere’s customer impact using the Control Plane

Paige: The last piece is sort of the Chronosphere smarts: Where we are helping you continuously optimize your telemetry state. That’s our platform, giving you assistance and advice to say: “Hey, we noticed that this label’s not used. This is probably a safe one to drop. It’s not always on you, the human, to be doing that. And if you’re like: “How much could dropping some unused metrics and labels really benefit me? I’m a little skeptical.”

I brought you some numbers here. Zillow, a housing platform, achieved an 80% optimization of their data volumes by tuning those parameters of what they were emitting, and at what timescale. Abnormal Security incredibly reached 98% metrics reduction by aggregating.

DoorDash was able to get 41% optimization of data volumes. If you’re running DIY observability, one thing that can go awry is if you’re not federating your Prometheus clusters, it can take a really long time to get response from your query. Robinhood was able to use our Query Accelerator to get an 8x improvement; I don’t know about you, but I would like my engineers who are on-call to get the answers to the questions they want — as quickly as possible.

Those are just a few use cases, just to highlight the impact of what we have going on. What does this whole end-to-end pipeline look like? Send us your traces, send us your metrics. We’ll hit the Chronosphere Collector. We’re supporting all the open source formats. Then come into the Chronosphere Control Plane, where you’re in control of your data to shape, transform, and manage it based on your organization’s use case.

Then, we will store that data for you. You’re charged for what is persisted, not everything that you send us over the wire. We do scale to 2 billion data points — that has been confirmed. From there, you can query, you can view things in dashboards, and of course, alert on this important data.

Why don’t we look at what some of this stuff looks like in real life? I’ll hand it back to Mofi to check out what Google’s got for Kubernetes visibility.

Continue the chat at minute 36:41 to hear the Q&A.

Watch the full webinar