This is a guest article written by Prometheus co-founder Julius Volz of PromLabs in partnership with Chronosphere. Prometheus is an open source project hosted by the Cloud Native Computing Foundation (CNCF) under an open governance. PromLabs is an independent company created by Julius Volz with a focus on Prometheus training and other Prometheus-related services.
On: Feb 22, 2024
Whether you are an online retailer, a physical goods manufacturer, a restaurant chain, or any kind of larger business, complex IT systems are at the core of everything these days. Organizations increasingly depend on them for their ability to operate on a day to day basis, whether they run their IT on-premises or in the cloud.
Unfortunately, IT systems of even moderate complexity are prone to various outages and failures. In the best case, a failure will cause a minor inconvenience, while in the worst case it may cause a costly and prolonged business disruption or even put your entire company in jeopardy.
While it’s unrealistic to prevent all failures occurring in your IT infrastructure, you can work to detect as many problems as possible before they become a larger issue: This is the goal of systems monitoring or observability. Paired with redundant and fault-tolerant system design, you can use systems monitoring to find and fix issues before the users of the system even notice a degradation.
In this article, we look at the most popular ways to gain insight into your IT systems, especially when it comes to detecting bad system behavior and taking corrective action. We will cover the pros and cons of each approach, and then finally highlight Prometheus as one of the most popular monitoring systems for the purpose of making sure that your business continues to run smoothly.
Before you start to measure system behaviors, it is a good exercise to ask yourself which behaviors you actually care about in a system, and which deviations from the desired behaviors mean trouble. While the details depend on the system in question, users of a system will generally care about aspects such as:
Ensuring these properties helps you maintain a fast and reliable service to your users while making the best use of your resources.
The reality is that modern IT systems are becoming increasingly complex. Especially with the proliferation of cloud services, cluster managers like Kubernetes, and microservice architectures, you often end up with distributed systems consisting of a plethora of layered hardware and software components that all need to function and work together correctly in order for your business to operate.
Given the scale and complexity of modern infrastructures, it is actually rare for most large IT systems to run without any problems whatsoever – most of them are in a constant state of partial degradation and brokenness, even if this is not always immediately noticeable to users.
Here are a couple of examples to illustrate the variety of things that can go wrong in systems and services:
There are many more things that can go wrong, but the above are a few examples to give you an idea of the kinds of issues you may encounter. If you want your business to run smoothly, you will want to closely monitor the behavior of your systems to detect these problems as quickly as possible so you can take corrective action before their impact becomes visible to users.
With that in mind, what are some good system properties to monitor that help us detect these problems? While the exact properties will depend on the system you are monitoring, there are a few common characteristics that matter across a wide range of systems. They include:
Whether it’s temperature, humidity, voltage, the current system time, or other metrics, there are many more aspects that you may want to measure to ensure that your system is behaving correctly. Let’s see which approaches exist to allow us to measure these system properties and react to them.
There are a variety of ways to measure your systems’ behaviors and get insight into them. Each of them comes with different tradeoffs and applicability for different use cases. Thus in many cases, an organization will not only settle on using a single approach, but use multiple in parallel. Let’s have a look at three of the most popular signal types that you can record, and how they work: event logging, tracing, and metrics.
In event logging, a process emits a timestamped record of every individual request or event happening inside the process. The record will often include structured fields covering a wide range of details related to the event. For example, a logged HTTP request may include the client ID, the HTTP method, path, and response status code, as fields:
Event logging gives you very detailed insight into your request-serving systems, allowing you to inspect fine-grained details of each individual processed event. However, logging also comes with significant drawbacks:
Due to these limitations, event logs usually only give us partial visibility into a system, and querying large log volumes to determine the overall health of a system can become prohibitively expensive.
Traditional logging is also limited in its ability to help you find and correlate multiple related events, such as a database query that is executed as part of an overall user request to your frontend service. This is where traces come in.
Traces help you understand the full path of a user request through a set of layered systems. For example, understanding the path of a request all the way from a load balancer through the frontend service, and then to the backend database.
When a request hits the first component of a system (in the previous example, the load balancer), it is assigned a unique trace ID (UID) that is propagated from one component to the next along the entire path that the request takes. Each component then records so-called spans that track a given amount of work performed as part of the overall request handling. Each span has a start and end time and includes the request’s trace ID. Traces also record hierarchical relationships between spans when the work related to one span is executed in the context of a higher-level parent span.
The resulting traces are then sent to a centralized system that stores and makes them available for querying. Since spans related to a single request share the same trace ID, they can be efficiently looked up and presented together in a correlated way that can help you debug in detail what happened to an individual request.
While tracing is the method of choice for understanding and debugging the path of individual user requests through a system, it also comes with a set of drawbacks that are similar to those of the logging approach:
As a result, traces are usually not the primary way of monitoring the health of a system, but are often used in addition to other methods to understand and debug the flow of individual user requests.
Metrics (or time series) are numeric values that are sampled and stored at regular intervals over time. For example, you may want to record various Kubernetes container metrics:
In contrast to logs and traces, metrics don’t allow you to track details for individual events, but they still allow you to record an aggregate value of things that happened, like the total number of HTTP requests handled by a given server process, optionally split into multiple time series along dimensions such as the HTTP path, method, or status code.
While metrics don’t allow you to gain as much per-request insight, they come with some benefits as compared to logs and traces:
As a result, most organizations base their main systems monitoring strategy on a metrics-based system, while still capturing logs and traces for specific business areas and use cases.
When it comes to metrics, the Prometheus monitoring system has evolved to become the de facto standard open source tool of choice across the IT industry. The Prometheus project provides a set of libraries and server components for tracking and exposing metrics data from systems and services, for collecting and storing those metrics, and for making them accessible and useful for a variety of use cases.
Here are some of the main features that make Prometheus so popular:
With the omnipresence of complex IT systems and their proneness to failure, you need to be able to get insight into the state of your systems and react to faults before they can disrupt your business. In this article we examined three popular approaches for helping us make sense of our systems and services: logs, traces, and metrics, and how they can track system health.
While each of these signal types comes with different tradeoffs and suitable use cases, metrics usually form the backbone of an organization’s monitoring strategy. This is due to their relatively cheap cost and their broad applicability to recording a multitude of different system health indicators.
When it comes to metrics-based monitoring, many organizations across the industry have adopted Prometheus as their primary system for mission-critical monitoring and alerting, and a large community of developers and users are contributing integrations.
This widespread adoption has made monitoring much easier for organizations to integrate into their technology stack and ensure overall system health – and get the reliability, availability, and stability they desire out of their systems.
For more information on Prometheus, check out the following articles from Julius: