Choosing the Right Telemetry: Logs vs. Metrics vs. Traces

A green and blue graphic showing a bar chart with a magnifying glass icon. There are padlock icons in the background, suggesting themes of data analysis, security, and telemetry.

Blog

Logs, metrics, and traces – let’s review each type of telemetry, from their inherent characteristics to their trade-offs.

On: Aug 7, 2024

12 MINS READ

Three types of telemetry: Which is the best?

Which is the best observability telemetry: logs, metrics, or traces? Depending on the observability vendor you ask, it’s often the one where they excel. Sure, there is some overlap. You can accomplish similar tasks with each, but using one type in all situations probably isn’t ideal.

This can be compared to a carpenter’s toolkit with different saws. The hand saw is versatile for general straight cuts in wood, but requires more effort and isn’t ideal for thicker materials. The circular saw, powerful and fast, handles large, straight cuts efficiently, but isn’t suited for precision work or curves. The jigsaw excels at cutting curves and intricate shapes across various materials, but is slower for long, straight cuts. Each saw is best for its intended purpose, just as logs, metrics, and traces each have unique strengths in observability.

You could cut a curve with a power saw, but it would be very choppy and uneven. You could also use a handsaw, but it would take longer and may not be as precise.

So which one is the best? At the end of the day, there is no “right” answer. It really depends on what you’re willing to trade off (time, money, accuracy, etc.) against your specific situation. Each telemetry type has its strengths and weaknesses, and the best choice depends on your unique needs and priorities.

With that idea in mind, let’s review each type of telemetry: logs, metrics, and traces. What are their inherent characteristics? What are the trade-offs?

Logs

Logs are arguably the simplest and most straightforward telemetry type to generate. Just write some information to a file or a stream, and voila, you’re logging! Because we can write just about anything to a log (and doing so is easy, too!), logs can be used in a broad range of observability use cases. They can be consumed both by searching for specific records or aggregating log records based on some criteria for ad-hoc exploration and analysis. Structured logs with common fields are more easily used for filtering/aggregation, though logs can also be unstructured. The lack of inherent structure can present a challenge when querying if teams don’t develop and adhere to a standard format for the logs they generate.

From a cost perspective, logs become more expensive as the number and size of log records increases. Put simply, the cost to store and query/analyze logs directly scales with increases in data volume, making logs more expensive for applications with high request rates, or those logging verbose details about the requests they are handling. One notable benefit of logs is that their costs are generally not impacted by data cardinality. It’s perfectly reasonable to log fields like requestIDs that have unique values for each record. The caveat is that analyzing and grouping records based on high-cardinality fields can be expensive, due to the large number of groups that would then be generated and returned when querying.

Given that logs are particularly easy to implement and tempting to use as a solution everywhere, it may be more effective to ask, “Why shouldn’t I use metrics or traces here?” rather than if logs should be used. Logs excel in cases that benefit from their flexibility, such as recording complex information like stack traces, or performing ad-hoc analysis that may involve many different medium to high-cardinality fields. Since log costs grow with data volume, they are best used for shorter time windows to limit the potential cost of querying over large volumes of log data. For long term historical retention, logs are typically archived and have to be “rehydrated” into active storage for ad-hoc analysis. For use cases that include looking at specific historical information over an extended time window, metrics are generally more cost-effective…

Pros

Flexible, expressive storage of information
Able to perform ad-hoc search/analysis easily
Can store and query over high cardinality data without issues

Cons

Often verbose; lack of inherent structure requires teams to create their own
Costs grow with volume, can become quite expensive in high request/s cases
Long-term retention/analysis can be quite expensive

O’Reilly Book: Cloud Native Observability

Learn about practical challenges and solutions for modern architectures.

Download your copy now!

Aggregating with metrics

Unlike logs, metrics have a much more specific intended use, and trade the general-purpose flexibility of logs for efficiency when applied appropriately. Instead of emitting many events to track activity in applications, metrics aggregate measurements and report them periodically. This aggregation is done as an optimization, to make the measurements we are interested in cheaper to store and query. As an example, instead of emitting thousands of events per second to count application requests, metrics emit a fixed number of periodic observations that record the count of requests handled over a window of time (say, once a minute). As a result, metrics have very different cost scaling behavior than logs, and can be orders of magnitude cheaper to store and query.

As noted above, the optimization that metrics provide comes with a couple of tradeoffs. First, metrics are primarily designed for tracking numerical measurements and are not well suited for recording general applications and systems information. Metrics do make it possible to encode metadata of interest as dimensions and there are patterns that exist for using metrics to capture general information about applications, but working with them this way is not as intuitive as doing the same with logs. Second, metrics can be fantastically cheap to store and process, but they have their own unique cost and scaling model compared to logs. Because metrics aggregate measurements over time across a given set of dimensions, their cost is determined by two factors:

Data Resolution: The shorter the interval between observations, the less aggregation is done, increasing costs.
Data Cardinality: the more unique combinations of dimensions there are to aggregate across, the less aggregation is done, increasing costs.

Data cardinality is particularly important because it can increase rapidly as we add dimensions. For example, if you add a new dimension to a metric with just two possible values, your metric’s cost will double. This is because it now has twice as many unique values to track; all of the permutations from your original dimensions, additionally broken down by the two possible values for the new dimension. If we don’t carefully manage the dimensions we track for our metrics, costs can quickly escalate, which is why we hear a lot about the cost of high cardinality metrics in modern systems.

Metrics are best leveraged in cases where their optimization benefits shine, providing fast and efficient access to key measures, even over longer windows of time, or when other data types are unsuitable. This may include use cases such as tracking KPIs used in alerts, as well as measures where the ability to view trends over a long period of time is desired. Metrics are less suited for flexible or ad-hoc exploration. For such cases, logs or traces are a better option, as adding many dimensions to metrics to support flexible analysis will increase their costs significantly.

Pros

Very efficient storage/performance for well-known use cases
Costs do not grow with application request-volume

Cons

The structure is not well suited for capturing verbose information
Not as effective for ad-hoc analysis
Cost scaling with cardinality means flexibility and cost are often at odds

Connecting the dots with traces

Traces are arguably the most complex telemetry type among the “three pillars” to work with, but their structure bears a striking resemblance to uniformly structured logs. Traces show relationships between connected operations in a request. They include information in each event, or “span” an application emits, that identifies the parent operation that called it, and then connect related spans together. This connection of spans helps analyze how downstream operations impact upstream callers, providing a clear understanding of dependencies and performance bottlenecks.

However, traces are more labor-intensive to implement. This is because their inherent value lies in connecting instrumentation across multiple services, unlike logs or metrics which typically focus on a single service. Additionally, having broken traces in systems can be problematic, where individual spans reference a parent that was never received or “start” part way through a request due to issues with trace context propagation within the service that created them. These types of issues can be difficult to track down and fix, but left unresolved they can seriously reduce the value traces can provide.

In terms of costs, traces are again very similar to logs, and scale with the number of spans being emitted and the size of their tags/associated metadata. Organizations often implement sampling of traces to help manage these costs as the volume of requests handled by instrumented services increases. Sampling reduces costs by storing fewer traces. Typically, the goal is to omit “low value” traces, such as those that track high-throughput/low-complexity requests, or “less interesting” behaviors like quickly completed successful requests.

Traces are best used for ad-hoc analysis, especially when understanding the relationship between dependent services or operations. They share many similar properties with logs, leading to significant overlap between them. However, traces are generally preferable in highly interconnected systems where the relationship information they capture is likely to be highly valuable.

Pros

Easily understand the relationship between dependent services/operations
Able to do ad-hoc search/analysis easily
Can query over high-cardinality data without issues

Cons

Relatively complex to implement compared to metrics or logs
Costs grow with volume and can become quite expensive in high-request cases
Not suitable for long-term retention/analysis

When should I use logs vs. metrics vs. traces?

When deciding how to instrument your systems and what kinds of telemetry data to rely on, it’s important to consider your overall observability strategy and the types of use cases you want to be able to enable, and then leverage the different data types accordingly. Because they have different strengths and weaknesses, it’s often best to use a mix of metrics, traces, and logs, mapping each telemetry type to its associated benefits, and preferring others to help cover its weaknesses. While the details here can vary significantly and there really isn’t a “wrong” answer, here are some guidelines that can help you determine which might be the right fit for a given use case:

Prefer metrics for frequently referenced KPIs or alert inputs, where performance is important and access patterns are known in advance
Prefer metrics for tracking measurements over longer periods of time, such as months or years
Prefer traces for cases where understanding the relationship between dependent services/operations is of particular value
Prefer logs or traces for cases where ad-hoc/investigative analysis across high cardinality fields is desired
Prefer logs for cases where more free-form information is of interest, such as error messages, stack traces, etc.

What do these guidelines mean in practice? Let’s look at an example to help illustrate:

When creating a new microservice in a system, you’ll likely want to add instrumentation for a blend of logs, metrics, and traces to gain comprehensive visibility into different aspects of its behavior. Here’s an example of how you might use each of these telemetry types effectively:

Use Logs to capture detailed error messages and stack traces from our service, along with any key internal activity for the service, and any information that may be required for audit/security purposes.
- Logs should be structured and contain high-cardinality attributes that are excluded from metrics.
Use Metrics to measure KPIs such as request rate, error rate, latency distribution, and resource utilization (CPU, memory, disk/network I/O), acting as inputs to service level objectives (SLO) and core alerts, as well as helping with capacity planning.
- Our metrics will primarily be measuring information at the service/endpoint level, and should generally avoid high-cardinality dimensions such as the ID of the user issuing requests that it handles.
Use Traces to understand upstream and downstream dependencies for our service, and help isolate where issues are originating from when we are alerted to a problem with our new service.
- Leveraging sampling to mainly capture traces with errors or latency exceeding our SLOs is optional, but also recommended.
- Traces should contain high-cardinality attributes that are excluded from metrics, like userIDs.

Conclusion

So which is best? As you’ve read, no single type of telemetry is universally the best; each has its own strengths and weaknesses. Be wary of anyone telling you a specific telemetry type is superior. Different telemetry types can often achieve similar outcomes for the same use case, but there are always trade-offs involved. Like the saw analogy at the beginning, each type excels in certain scenarios while being less effective in others, requiring a balanced approach to leverage their specific advantages.