Logs, metrics, and traces – let’s review each type of telemetry, from their inherent characteristics to their trade-offs.
John is a Senior Sales Engineer at Chronosphere with nearly a decade of experience in the monitoring and observability space. John started as an engineer working on time-series data collection and analysis before moving to a pre-sales/customer-support role, and has worked with numerous companies across industries to solve their unique observability challenges.
Scott Kelly is a Sr. Product Marketing Manager at Chronosphere. Previously, he worked at VMware on the Tanzu Observability (Wavefront) team and led partner go-to-market strategies for VMware’s Tanzu portfolio with AWS and Microsoft Azure. Prior to VMware, Scott spent three years in product marketing at Dynatrace. Outside of work, Scott enjoys CrossFit, tackling home improvement projects, and spending time with his family in Naples, FL.
On: Aug 7, 2024
Which is the best observability telemetry: logs, metrics, or traces? Depending on the observability vendor you ask, it’s often the one where they excel. Sure, there is some overlap. You can accomplish similar tasks with each, but using one type in all situations probably isn’t ideal.
This can be compared to a carpenter’s toolkit with different saws. The hand saw is versatile for general straight cuts in wood, but requires more effort and isn’t ideal for thicker materials. The circular saw, powerful and fast, handles large, straight cuts efficiently, but isn’t suited for precision work or curves. The jigsaw excels at cutting curves and intricate shapes across various materials, but is slower for long, straight cuts. Each saw is best for its intended purpose, just as logs, metrics, and traces each have unique strengths in observability.
You could cut a curve with a power saw, but it would be very choppy and uneven. You could also use a handsaw, but it would take longer and may not be as precise.
So which one is the best? At the end of the day, there is no “right” answer. It really depends on what you’re willing to trade off (time, money, accuracy, etc.) against your specific situation. Each telemetry type has its strengths and weaknesses, and the best choice depends on your unique needs and priorities.
With that idea in mind, let’s review each type of telemetry: logs, metrics, and traces. What are their inherent characteristics? What are the trade-offs?
Logs are arguably the simplest and most straightforward telemetry type to generate. Just write some information to a file or a stream, and voila, you’re logging! Because we can write just about anything to a log (and doing so is easy, too!), logs can be used in a broad range of observability use cases. They can be consumed both by searching for specific records or aggregating log records based on some criteria for ad-hoc exploration and analysis. Structured logs with common fields are more easily used for filtering/aggregation, though logs can also be unstructured. The lack of inherent structure can present a challenge when querying if teams don’t develop and adhere to a standard format for the logs they generate.
From a cost perspective, logs become more expensive as the number and size of log records increases. Put simply, the cost to store and query/analyze logs directly scales with increases in data volume, making logs more expensive for applications with high request rates, or those logging verbose details about the requests they are handling. One notable benefit of logs is that their costs are generally not impacted by data cardinality. It’s perfectly reasonable to log fields like requestIDs that have unique values for each record. The caveat is that analyzing and grouping records based on high-cardinality fields can be expensive, due to the large number of groups that would then be generated and returned when querying.
Given that logs are particularly easy to implement and tempting to use as a solution everywhere, it may be more effective to ask, “Why shouldn’t I use metrics or traces here?” rather than if logs should be used. Logs excel in cases that benefit from their flexibility, such as recording complex information like stack traces, or performing ad-hoc analysis that may involve many different medium to high-cardinality fields. Since log costs grow with data volume, they are best used for shorter time windows to limit the potential cost of querying over large volumes of log data. For long term historical retention, logs are typically archived and have to be “rehydrated” into active storage for ad-hoc analysis. For use cases that include looking at specific historical information over an extended time window, metrics are generally more cost-effective…
Unlike logs, metrics have a much more specific intended use, and trade the general-purpose flexibility of logs for efficiency when applied appropriately. Instead of emitting many events to track activity in applications, metrics aggregate measurements and report them periodically. This aggregation is done as an optimization, to make the measurements we are interested in cheaper to store and query. As an example, instead of emitting thousands of events per second to count application requests, metrics emit a fixed number of periodic observations that record the count of requests handled over a window of time (say, once a minute). As a result, metrics have very different cost scaling behavior than logs, and can be orders of magnitude cheaper to store and query.
As noted above, the optimization that metrics provide comes with a couple of tradeoffs. First, metrics are primarily designed for tracking numerical measurements and are not well suited for recording general applications and systems information. Metrics do make it possible to encode metadata of interest as dimensions and there are patterns that exist for using metrics to capture general information about applications, but working with them this way is not as intuitive as doing the same with logs. Second, metrics can be fantastically cheap to store and process, but they have their own unique cost and scaling model compared to logs. Because metrics aggregate measurements over time across a given set of dimensions, their cost is determined by two factors:
Data cardinality is particularly important because it can increase rapidly as we add dimensions. For example, if you add a new dimension to a metric with just two possible values, your metric’s cost will double. This is because it now has twice as many unique values to track; all of the permutations from your original dimensions, additionally broken down by the two possible values for the new dimension. If we don’t carefully manage the dimensions we track for our metrics, costs can quickly escalate, which is why we hear a lot about the cost of high cardinality metrics in modern systems.
Metrics are best leveraged in cases where their optimization benefits shine, providing fast and efficient access to key measures, even over longer windows of time, or when other data types are unsuitable. This may include use cases such as tracking KPIs used in alerts, as well as measures where the ability to view trends over a long period of time is desired. Metrics are less suited for flexible or ad-hoc exploration. For such cases, logs or traces are a better option, as adding many dimensions to metrics to support flexible analysis will increase their costs significantly.
Traces are arguably the most complex telemetry type among the “three pillars” to work with, but their structure bears a striking resemblance to uniformly structured logs. Traces show relationships between connected operations in a request. They include information in each event, or “span” an application emits, that identifies the parent operation that called it, and then connect related spans together. This connection of spans helps analyze how downstream operations impact upstream callers, providing a clear understanding of dependencies and performance bottlenecks.
However, traces are more labor-intensive to implement. This is because their inherent value lies in connecting instrumentation across multiple services, unlike logs or metrics which typically focus on a single service. Additionally, having broken traces in systems can be problematic, where individual spans reference a parent that was never received or “start” part way through a request due to issues with trace context propagation within the service that created them. These types of issues can be difficult to track down and fix, but left unresolved they can seriously reduce the value traces can provide.
In terms of costs, traces are again very similar to logs, and scale with the number of spans being emitted and the size of their tags/associated metadata. Organizations often implement sampling of traces to help manage these costs as the volume of requests handled by instrumented services increases. Sampling reduces costs by storing fewer traces. Typically, the goal is to omit “low value” traces, such as those that track high-throughput/low-complexity requests, or “less interesting” behaviors like quickly completed successful requests.
Traces are best used for ad-hoc analysis, especially when understanding the relationship between dependent services or operations. They share many similar properties with logs, leading to significant overlap between them. However, traces are generally preferable in highly interconnected systems where the relationship information they capture is likely to be highly valuable.
When deciding how to instrument your systems and what kinds of telemetry data to rely on, it’s important to consider your overall observability strategy and the types of use cases you want to be able to enable, and then leverage the different data types accordingly. Because they have different strengths and weaknesses, it’s often best to use a mix of metrics, traces, and logs, mapping each telemetry type to its associated benefits, and preferring others to help cover its weaknesses. While the details here can vary significantly and there really isn’t a “wrong” answer, here are some guidelines that can help you determine which might be the right fit for a given use case:
What do these guidelines mean in practice? Let’s look at an example to help illustrate:
When creating a new microservice in a system, you’ll likely want to add instrumentation for a blend of logs, metrics, and traces to gain comprehensive visibility into different aspects of its behavior. Here’s an example of how you might use each of these telemetry types effectively:
So which is best? As you’ve read, no single type of telemetry is universally the best; each has its own strengths and weaknesses. Be wary of anyone telling you a specific telemetry type is superior. Different telemetry types can often achieve similar outcomes for the same use case, but there are always trade-offs involved. Like the saw analogy at the beginning, each type excels in certain scenarios while being less effective in others, requiring a balanced approach to leverage their specific advantages.
Request a demo for an in depth walk through of the platform!