Prometheus is an open source metrics-based monitoring solution that has quickly become the de-facto standard for modern, cloud native organizations to successfully monitor their systems. However, for companies that haven’t made the move over from proprietary monitoring and application performance monitoring (APM) solutions for cloud native environments, shifting applications may feel a little daunting.
Chronosphere and DZone recently hosted a webinar on Five reasons to choose Prometheus when getting started with cloud native. Hear from Prometheus co-founder Julius Volz on Prometheus’ history and why it is the right fit for organizations that use cloud native observability, alongside Chronosphere’s Senior Product Marketing Manager Scott Kelly on Chronosphere’s compatibility with Prometheus, and how our technology works to support the way engineers need and want to work.
If you don’t have time to sit down for the full recording, read the transcript below.
What is Prometheus?
Julius Volz: My name is Julius, and I’m the co-founder of the Prometheus open source monitoring project, but also the sole person behind the company PromLabs.
Wth PromLabs, I mostly do trainings these days – both live trainings around Prometheus, and self-paced online courses. Today, I’m going to do something a bit unusual for me … Which is to talk about the five reasons why you might want to go with Prometheus when you’re looking at cloud native monitoring.
If you have not heard about Prometheus yet, let’s talk about what it actually is. Prometheus is a metrics-based monitoring and alerting stack that works particularly well with a dynamic cloud based environment — and it’s also open source. The history behind Prometheus is that we started it 10 years ago.
We were coming from Google to SoundCloud, and we kind of missed Google’s internal Borgmon monitoring system. We already found a very dynamic, cluster based environment at SoundCloud, and didn’t have anything proper to monitor it.
After some adoption at SoundCloud, we fully published it in 2015. In May of 2016, we joined the Cloud Native Computing Foundation (CNCF) as the second project in that foundation. Guess what the first project was? Kubernetes.
Just a couple of months later, we released the first stable, major version of Prometheus 1.0. It was still pretty early, so we still wanted to do a lot of breaking changes. Not too much later, we released Prometheus 2.0.
Since 2017, we’ve actually been able to not break any major API surfaces in Prometheus. We’re still on Version 2.X without breaking much. We’re still able to improve the stability, add features, and integrate with more and more things. We have Prometheus conferences, and we have started to standardize some of the interfaces in Prometheus. There is a lot of maturing going on.
Julius: Typically, you would run Prometheus as one or multiple servers in your infrastructure. That’s the middle box [in the image above], where you configure that Prometheus server to collect metrics over HTTP from a set of things that you want to monitor. We call these targets, and they could be services where you control the code and you can directly use a Prometheus instrumentation client library to expose the most beautiful metrics from directly inside of your code.
Or, it could also be some third-party code bases or even hardware devices where you cannot directly do that. For the model of running a little agent next to the thing you care about, we call that an exporter. That does a translation into this text-based Prometheus metrics format for you, so that Prometheus can scrape metrics from that device or third-party software.
Then, Prometheus collects and stores those metrics on a regular basis. It builds time series out of them over time. In the end it allows you to both query, visualize, generate alerts and do all kinds of other useful stuff based on the data you collected. One important thing here is that, for the cloud native world, Prometheus integrates very well with all kinds of different types of service discoveries in your environment.
For example, it can talk to a Kubernetes API server — DNS, Consul, or other sources of truth on an ongoing basis, to discover these targets on the left and know all the things that it currently should be monitoring.
Reason #1: Large and mature ecosystem
Julius: Let’s go into the first reason why you might want to use Prometheus. By now, Prometheus is very mature and the ecosystem is also mature and large. We started ten years ago and it’s been open source, or really fully public, since . We have been at a pretty stable version of 2.X for a long time. You’ve been seeing startups and big enterprises and banks using Prometheus for their mission critical monitoring. You have Wal-Mart and Disney, and all kinds of other big players using it.
Since we have a huge community, there’s many people helping us integrate with anything under the sun. We have almost 1,000 exporters only that we know about, but there’s probably way more. These are all these little agents that help you get metrics out of the different things that you care about. There’s also a lot of software out there these days that directly exposes Prometheus metrics about itself. You don’t even need an exporter anymore. Also, you will usually find for any language, any major language, a client library that allows you to put metrics out from your own code.
Reason #2: Prometheus as the de-facto standard monitoring system
Julius: Let’s talk about Prometheus being the de facto standard in open source metrics-based monitoring, and what that means also for portability. GitHub Stars is a little bit of a vanity metric, but as an open source project, we don’t have really great metrics about how many real users there are, and how many installs there are because we don’t send home telemetry.
This is a fun proxy metric, that at least shows some level of popularity of the project. I think it’s really fair to say that Prometheus has become the de facto standard for open source metrics-based monitoring.
If you compare [the image above] to Kubernetes, Kubernetes is one of the largest open source projects in the world. We have almost half of the number of Stars, just to give you a sense of scale.
By now, almost everyone wants to integrate in one way or another, with Prometheus. On one hand, you have the large cloud vendors: AWS, Google Cloud, and Azure who basically have managed Prometheus offerings where you can send Prometheus data, and they do Prometheus compatible querying for you, or at least they will claim that they will do that and some also do.
You also have Chronosphere on here — they also give you a service where you can send your Prometheus data and have alerting, dashboarding, and everything based on your Prometheus data, so that you don’t have to run it yourself.
All these kinds of integrations are happening and there’s a question: How are these different players integrating? In the past, when we started out with Prometheus, there was a lot of focus on the actual boxes, or specific implementations from the open source project. You had the Prometheus server pulling some data from a target. It’s potentially forwarding them to some remote storage system. Maybe it’s being queried, or it’s sending out alerts. But nowadays, since everyone wants to interoperate, the focus is much more on these arrows in between the boxes.
How does a Prometheus server get metrics out of a server? We have the remote write protocol. That is how Prometheus can forward already collected data to some potentially more scalable or durable storage backend.
The PromQL query language is a very major interface that is basically the big interface to do anything useful with the collected data. Then, also the alerting protocol [with]many different players. These are ones that are actually relatively PromQL specific that allow you to query back the data that you send to them in a PromQL almost compatible or fully compatible way.
This is really great in general. If we have all these open standards for these interfaces, then that allows users to choose between different trade offs, so you don’t necessarily have to go with just the vanilla, open source upstream implementations. You might go with some cloud based solution, or a different open source based one that is maybe more horizontally scalable and so on. This leads to a larger ecosystem with better competition — which is hopefully better for everyone.
Prometheus conformance program
Julius: Of course, there’s one big caveat – which is compatibility. Once everyone starts implementing similar interfaces, you hope that they’re not just similar, but actually compatible. To ensure that more and more, as the Prometheus open source project, we are starting this Prometheus conformance program with both a technical part, and a legal part. Both are still a work in progress, but progressing fast. The technical part is a repo where vendors can self certify that they are compatible with certain interfaces.
Then, the legal part is basically still being defined together with the Linux foundation, where a cloud vendor who claims to be Prometheus compatible can create a contract with the Linux foundation, and then after self certifying, they’re able to use certain compatibility marks in their copy.
Reason #3: Open source and open governance
Julius: Prometheus is open source, which first of all means it is free to start using. Of course, it’s not free to use. You have to invest time and engineering effort and everything, but you don’t have to pay license fees. It’s not a black box, which means:
A) You can look inside [the code].
B) You can also modify it to your needs if you need to.
There’s not one single company behind it — it survives single companies going bankrupt, or going out of business for some other reason. Besides that, us being part of the CNCF means that we have a neutral home. The CNCF owns the official assets, like the domain, trademark and some other stuff, and it helps us in different ways with events and lots of support. It also means that it’s kind of a “Switzerland,” where all kinds of different companies and stakeholders can come together and all work on Prometheus together under open governance. It just means that you can also come in and start contributing, or maybe you will become a team member after a while, then you can start voting on the project direction. You can have an actual voice as well.
Finally, there’s the open source and talent pool aspect. Engineers really love working on open source software. That also means that you can find a lot of people who would love working on open source Prometheus, but also generally, because it is already the de facto standard in that area, you will already find a lot of people as well who can do that.
Reason #4: PromQL as a powerful query language
Julius: I think this is one of the biggest technical selling points of Prometheus — the query language that we have: PromQL. Basically, it’s a read-only, functional, query language that is specifically built for doing cool stuff with metrics: Selecting metrics, transforming them and aggregating them, but also doing all kinds of vector arithmetic math between sets of metrics.
The idea here is really that you collect all the data as metrics — in this label based dimensional format in Prometheus first — and then there’s this one unified, query language that allows you to cover all kinds of different use cases. So, not only alerting, but also dashboarding in one system with the same language — and ad hoc debugging as well, or capacity planning, automation, or data science. Any other kind of things you can think of that fit into this metrics-based model can usually go through PromQL as a layer.
Continue this portion of the conversation at minute 21:32.
Reason #5: Prometheus’ out-of-the-box functionality with Kubernetes
Julius: Prometheus actually works, out of the box, really well with Kubernetes. That is no accident — because at Google there was Borg and Borgmon. Borgmon was the monitoring system to go along with that. Basically, both Borg, the cluster itself, was monitored with Borgmon, and everything running on Borg was also monitored with Borgmon: gmail, search and all these services. But then, both of these internal implementations were open source implementations, or ones that were at least inspired by the internal PONDL.
So, Borg inspired Kubernetes – which was directly done by Google itself. And at roughly similar time, but without any coordination, at SoundCloud we created Prometheus, inspired by Borgmon. There’s a good reason why these systems interact pretty nicely with each other. When we fully published Prometheus, that was also kind of the time when Kubernetes started to become big and they pretty quickly added support for Prometheus as well. So you have, on the one hand, support in Kubernetes for Prometheus. You have all these different cluster components and the servers that are part of Kubernetes serving native Prometheus metrics endpoints. That is the primary format in which Kubernetes exposes Prometheus metrics. So you have the Kubernetes API server, you have the Kubelet on each host that runs and manages the containers, both exposing Prometheus metrics, you have etcd exposing Prometheus metrics, and other cluster components as well.
Julius: So, it’s really easy with Prometheus to monitor the cluster itself. Prometheus also supports monitoring stuff on Kubernetes really easily. That happens because Prometheus is able to talk to the Kubernetes API server in an ongoing, continuous way for service discovery, to discover your service endpoints or pods or ingresses — and all kinds of other cluster objects to not only find out that they exist, but also what they are. You get a lot of rich metadata labels that can then be inherited into the pull time series of those services, pods and so on.
Monitoring works pretty much out of the box. In both ways, in a Kubernetes and Prometheus scenario. In almost every Kubernetes cluster, you will also find some kind of Prometheus stuff somewhere. I think it’s really the best way to do metrics-based monitoring in Kubernetes cluster, because you don’t need any additional translation layer that uses a slightly different data model, so you don’t get as clean of metrics as you would in the original form. In a Kubernetes scenario, it’s really nice to use Prometheus.
Going forward with Prometheus
Julius: Just to summarize [why to use Prometheus for cloud native environments]:
1.) We have a large and mature ecosystem, which helps you really have some trust in Prometheus.
2.) We have the portability and standard aspect.
3.) We have the open source aspect that allows you to both inspect, change and also have some influence in the system if you want to.
4.) The query language really is just a technical enabler for all kinds of different use cases, from alerting to dashboarding.
5.) Kubernetes just works very well out-of-the-box [with Prometheus].
Resources to get started with Prometheus
Julius: So, I have some resources for you to get started as well. On one hand, there is the official open source project website, Prometheus.io. It has the full documentation of all features that exist, but it’s maybe not always the best place to fully learn everything from scratch in a structured way, because it’s more of a reference and not a kind of training. As part of what I’m doing, I’m creating these self-paced training courses as well.
While they are paid, some of them are free as well. Hopefully they’re helpful. If you are a Chronosphere customer, you actually get these for free.
See Prometheus in action
Julius: The first part [of the demo] is just showing how easy it is to download Prometheus and monitor the local host metrics. Then, for the second part, I’m just going to show Prometheus already running in a single node Kubernetes environment, a little bit more complex, just to show you the kind of metrics you can automatically and easily get there.
Continue this portion of the chat at minute 32:23.
An overview of Chronosphere
Scott Kelly: At Chronosphere, we’re very supportive of people getting started with Prometheus when they’re moving to cloud native. It’s a great solution, and it works out of the box. But, when you get to a certain scale, you can start to see some issues. We have some great assets that go into more depth about what those are, but there’s a few signs that you might run and start to see as you scale. We’ll walk through these pretty quickly just to give you an idea.
First of all, one of the issues with cloud native environments is that they create a lot of data — like 10 to 100x times their virtual machine counterparts. With that said, as the data starts to grow, these issues start to become a problem. For example:
- Not being able to find data quickly
- Engineers having to go look for data
- Sifting through data
- Trying to find the right dashboards, which can impede troubleshooting efforts
The other issue is, with data retention. Some of the things that you have to do within Prometheus to retain data for longer times can be cumbersome, and that can cause challenges. You may see decreasing reliability, and this means that as you add more data into the system, it may be more difficult for alerts to fire on time and for dashboards to load. That can impact your ability to keep your systems running reliably.
One of the other things is the cost versus collections trade off. As you grow and you’re collecting more data, you may get to a point where it’s very expensive, and now you’re making decisions about which data to collect, and which data to keep. When you start to have to do that, you start to potentially negatively impact your ability to remediate issues and run Prometheus effectively.
Scott: Chronosphere is a single tenant SaaS cloud native observability platform. We ingest metrics and traces via open standards to enable rapid mediation. It’s incredibly reliable, and we actually offer one of the highest [service-level agreements] (SLAs) in the industry.
You can see that our Collector allows you to collect metrics and trace data in your environment, and we support all standard open source formats. For Prometheus, it supports all standard discovery and configuration options that you’d be using. And, it uses a lot of the Prometheus open source code base, so it works really well.
The Control Plane
Next is our Control Plane, and this is incredibly unique in the market. It lets you fulfill your existing dashboard and alerting needs without having to store all of the observability data in raw form. What we do is pre-process it and optimize the data. The result is a reduced cost and improved performance.
How do we do that? First, we analyze: We allow you to analyze and understand the value of your observability data [so you know] what’s valuable, what’s junk, what can you get, [and] what do you not need?
Quotas to control cardinality spikes
Next, we give you tools to shape and transform data based on need, context, and utility. You control these costs by optimizing your data sets, and it gives teams the right data to solve problems faster.
We also allow you to delegate responsibility to control cardinality and growth — we allow you to set quotas where teams will now be responsible for managing their own metric data. This can control overages, and things such as cardinality spikes that can cause havoc with the system, and lead to cost overruns and impede other uses of the system.
The Query Accelerator
Lastly, we provide tools to allow you to continuously optimize the platform. We have things like what’s called the Query Accelerator. If anything changes within the platform that causes the query to slow down, we can actually help optimize it again so everything runs fast, dashboards load fast, and alerts fire on time. Now, we have the Chronosphere data store. We have the most efficient way of collecting and storing data at scale. We learned how to do industrial, high performing, and highly available storage for building M3 at Uber, where our founders came from. This database has proven to scale to 2 billion data points per second, or 20 billion active time series. It’s pretty scalable.
The Query Engine
Next is our Query Engine — it’s 100% PromQL compatible. As you saw with Julius, he was showing you the power of that. He talked about the query builder that he built, and we worked on together to integrate into the product. It’s very easy to use, very powerful, and 100% compatible. If you’re familiar with the PromQL language, you can use our solution out of the box.
The benefits of using Chronosphere
Scott: First of all, we help reduce observability data volumes by 50% or more. We do that by the controlling and shaping tools that we provide within the platform in the control plane. This helps you reduce your initial data volumes, but also control long term growth. And then, we are 100% open source compatible. You have no vendor lock in.
We talked about that the collector is all open source compatible. The database is also open source compatible. We’re PromQL, 100% open source compatible. All the ins and outs of the platform are 100% open source compatible, so that you don’t have to worry about vendor lock in. We’re one of the most reliable platforms out there. Our SLA is at three 9’s of uptime. We actually have been delivering an SLA that is better and as close as four 9’s and plus on that — very high reliability.
Lastly, on average, our customers spend 50% less time troubleshooting. Because of the shaping capabilities, speed, and the reliability of our platform, engineers get the information they need to quickly find and fix problems. When we talk to our customers and measure, they’re finding about a 50% reduction in the time they spend troubleshooting.
Continue the rest of the webinar at 46:35