An introduction to distributed tracing
Cloud native has revolutionized application development in ways that are both positive and challenging. Adoption of microservices architectures on container-based infrastructures enables faster software development lifecycles. At the same time, problems can strike when changes are made to apps, such as adding new features. Moreover, app updates can happen multiple times a day. So how do teams track down problems when error messages pop up, or when it suddenly takes longer to load an application?
Unlike the monolithic approach to application development, where a straightforward application call makes it easy to find where a problem exists, cloud native applications and the container-based infrastructure they run on are ephemeral and highly distributed. This means problems are elusive. The need for distributed tracing, which tells you exactly where a problem is happening, becomes acutely important for teams needing to quickly fix their applications.
Distributed tracing:
- Informs development teams about the health and status of deployed application systems and microservices
- Identifies irregular behavior that results from scaling automation
- Shows how latency or errors in a given service affect overall end-user experience for requests
- Highlights whether errors or slowness for a service’s requests are due to an issue in that service, or a downstream service it depends on
- Allows teams to look at the health of their services across dimensions that are too expensive to capture metrics for
How does distributed tracing benefit my organization?
Distributed tracing can benefit your organization by saving you time and money, as well as strengthening your internal team collaboration.
The four primary benefits of distributed tracing are:
- Improving problem identification and understanding. Traces are an early warning system to let teams know there is a problem. This also makes it easier for novice members of teams to more quickly understand what’s going on rather than having to call in a power user when something goes wrong.
- Reducing troubleshooting timeline. Distributed tracing provides insights into services that allow development teams to know and understand things like poor system health or identify where bottlenecks are in the software stack. This all happens in conjunction with the early warning capability described above. Teams are provided with the data they need to restore services, deliver a positive end-user experience, and adhere to the organization’s service-level agreements.
- Strengthening collaboration. Because distributed tracing enables teams to gain visibility into the complex interactions between microservices, it fosters a shared understanding of system performance and issues. This transparency encourages cross-functional collaboration, as separate engineering teams can collectively diagnose and resolve problems, leading to a more cohesive and efficient problem-solving process.
- Preserving your bottom-line. Finally, distributed tracing equips organizations with the insights needed to identify and address system bottlenecks or failures promptly, minimizing downtime. By ensuring the smooth operation of services, it helps avoid revenue loss associated with system outages, directly safeguarding the company’s bottom line.
The different types of tracing
Distributed tracing is just one type of tracing that is available for teams to gain a better understanding of their systems. Depending on the organization or software team you are on, you may use multiple different types of tracing, all of which work in the same general fashion, but applied to different use-cases or scopes within a system.
Different types of tracing include:
Code tracing: This is a granular approach that focuses on the execution flow within the software itself. It involves instrumenting the codebase to log detailed information about the execution paths that are taken, function calls, and the time spent in each segment of the code. This type of tracing is particularly useful for developers as it helps in debugging and optimizing the performance of individual services or functions within a distributed system. By providing a microscopic view of the code’s behavior, code tracing enables developers to pinpoint inefficiencies and errors at the source level.
Program tracing: This type of tracing is associated with APM (application performance monitoring) tools and extends beyond individual code segments to cover the execution of entire programs or applications. Program tracing captures the flow of execution across different modules within an application, as well as calls to dependent services, tracking where time is spent handling a request for a given application. Program tracing is instrumental in understanding the behavior of complex applications, especially when they have multiple dependencies such as databases or additional services. It helps in identifying how data flows between various modules, and potential bottlenecks at the program level.
End-to-end tracing: Another term for distributed tracing, end-to-end tracing allows us to follow a request across all of the services within a system and understand its behavior from end-to-end. This type of tracing aggregates the data collected from various points in the system to create a comprehensive picture of the transaction’s path, including all the microservices it interacts with. End-to-end tracing is invaluable for operations teams and system architects as it helps ensure that the system performs optimally from the perspective of the end user. It allows for the identification of latency issues, service dependencies, and the impact of each component on the overall system performance, facilitating a better understanding of a request’s behavior and improving system reliability.
Having walked through the three different types of tracing, we can see that they’re not mutually exclusive — vendor solutions commonly offer a blend of capabilities of the types. Still, the differences can be hard to see at a glance.
For more on the differences between the types of tracing and how the lines are being blurred, give the blog. Distributed tracing vs. APM: What’s the difference, a read.
Distributed tracing vs. APM: What’s the difference?
Knowing the differences, similarities, and uses between APM and distributed tracing is essential to keeping your systems optimized and reducing troubleshooting times.
How does distributed tracing work in a microservices architecture?
Distributed tracing makes it possible to see where things are happening. Distributed tracing works by capturing individual units of work, also known as spans, in a distributed system, with each span containing a reference to an ID for the overall request (the TraceID), as well as a reference to the ID of the span that came before it.
This structure makes it easy to analyze and view distributed traces both from the perspective of the individual units of work, as well as the overarching request. A great example of distributed tracing is a workflow request, which is a series of activities that are necessary to complete a task. We actually see workflow requests in everyday activities … like ordering our favorite cupcakes online. In the example below you’ll see how this works:
Let’s say Nichelle and Robin each want to know if red velvet cupcakes are in stock at their local bakery. Nichelle and Robin would get on their respective mobile phones, open the bakery application, and search for “red velvet.”
- When Nichelle and Robin initiate their searches for red velvet cupcakes, each triggers a workflow request to get information about inventory
- These workflow requests are then processed through application services
- Information is returned to their respective mobile apps
Keep in mind that each workflow request for Nichelle and Robin were the same — they each had to go through their applications and use the same services and asked for the same type of cupcake. However, the metadata associated with each of them — like tags, performance, or descriptors — may be different. While workflow requests may be the same for multiple users, the associated metadata is unique.
Seeing trace metadata is helpful for engineers investigating issues because it allows them to identify patterns, anomalies, or outliers and helps identify where issues lie in a stack.
You can learn more about how distributed tracing can be applied to your life like, tracking a vacation, by reading the blog, Explain it like I’m five: distributed tracing.
Distributed tracing vs. logging
Distributed tracing and logging are both important elements of observability in distributed systems, but serve distinct purposes.
Distributed tracing:
- Provides a detailed view of a request’s journey across multiple services, to trace interactions and understand the system’s behavior in real-time.
- Offers visibility into the flow and performance across microservices, highlighting the path requests take.
- Includes logs as part of the individual spans that are captured in a trace — for example, the OpenTelemetry specification for Distributed Traces includes SpanEvents, which can be used to record logs as part of a span. Logging records specific events within applications, such as errors or transactions, providing snapshots of what happened at particular moments.
Logging:
Unlike distributed tracing, logging does not trace the flow of requests across services, focusing instead on discrete events within a single service.
The challenges of early distributed tracing tools
The era of distributed tracing kicked off with specialized tracing tools, but these tools are rarely used. Why? The early wave of tracing tools were:
- Hard to use
- Adopted by technically advanced users — users who typically have a deep understanding of the architecture and tool
- Didn’t provide the level of detail needed to easily discover where a problem exists
- Didn’t provide good integration points with the existing workflows that engineering teams relied upon when troubleshooting issues
For example, an on-call engineer that got paged about an issue in a given service would not have a clear way to go from the service/behavior they were alerted on to the relevant traced requests to help understand the problem. This limitation leads to teams ignoring distributed tracing tools in favor of their existing workflows, or repeatedly escalating issues to advanced users that are capable of using the tools to get the necessary insights by hand.
Distributed tracing needs to be easy for novice and expert users alike
Let’s go back to our bakery example of Nichelle and Robin, who wanted to find out the inventory status of red velvet cupcakes. If there’s a problem searching inventory, engineers will likely get an error message and an admin would get an alert via their metric data that there is a problem. With distributed tracing instrumentation, engineers can quickly respond to alerts and analyze the end-to-end request flow. This allows them to identify which step is responsible for the failure — whether it be the component associated with the alert, or an error originated from another step farther down the request chain.
However, the process of going from alert to response needs to be as straightforward as possible for engineers — otherwise expert users will continue to be a common escalation point for understanding system failures and they will remain the only team member using distributed tracing tools.
Identifying and following steps that should be taken during incident-response workflows will ensure novices and experts alike are able to get value from distributed tracing.
How Chronosphere simplifies distributed tracing for any user
Finding where errors or latency have occurred in complex microservices environments is hard to do. It becomes even harder in the middle of the night when an inexperienced on-call engineer is trying to get services back online.
Chronosphere allows any engineer — not just power users — to seamlessly jump from an alert or a dashboard into related traces. Once there, engineers can quickly see where the source of the problem lies. By using a tool built with novice engineers in mind, any engineer can:
- Easily visualize how errors in a microservice are impacting services upstream in the request.
- Request a statistical analysis of all of the traces in a time window, and compare them to traces in a time window a few minutes prior to the incident. This allows engineers to look for differences.
Returning to our red velvet cupcake inventory example, I would want to compare two end user experiences:
- Nichelle’s request took 1/5th of one second to get inventory data when she checked
- Robin tried 5 minutes later and it took 1/10th of one second to get the same information
In the end, Chronosphere combines metrics and traces to help engineers quickly find where a problem is, so it can be fixed and your business can get back on track.
Additional resources
Curious to know more about Chronosphere and distributed tracing? Check out the following resources: