What is distributed tracing?

on November 15th 2022
Blog
•  tracing

Cloud native has revolutionized application development in ways that are both positive and challenging. Adoption of microservices architectures on container-based infrastructures enables faster software development lifecycles. At the same time, problems can strike when changes are made to apps, such as adding new features. Moreover, app updates can happen multiple times a day. So how do teams track down problems when error messages pop up, or when it suddenly takes longer to load an application?

Unlike the monolithic approach to application development, where a straightforward application call makes it easy to find where a problem exists, cloud native applications and the container-based infrastructure they run on are ephemeral. This means problems are elusive. The need for distributing tracing, which tells you exactly where a problem is happening, becomes acutely important for teams needing to quickly fix their applications.

What does distributed tracing do?

Distributed tracing makes it possible to see where things are happening. Additionally, distributed tracing captures individual units of work, also known as spans, in a distributed system. A great example of distributed tracing is a workflow request, which is a series of activities that are necessary to complete a task. We actually see workflow requests in everyday activities … like ordering our favorite cupcakes online. In the example below you’ll see how this works:

Let’s say Nichelle and Robin each wants to know if red velvet cupcakes are in stock at their local bakery. Nichelle and Robin would get on their respective mobile phones, open the bakery application, and search for “red velvet.” 

  • When Nichelle and Robin initiate their searches for red velvet cupcakes, each triggers a workflow request to get information about inventory 
  • These workflow requests are then processed through application services
  • Information is returned to their respective mobile apps 

Keep in mind that each workflow request for Nichelle and Robin were the same — they each had to go through their applications and use the same services and asked for the same type of cupcake. However, the metadata associated with each of them — like tags, performance, or descriptors — may be different. While workflow requests may be the same for multiple users, the associated metadata is unique.

Seeing trace metadata is helpful for engineers investigating issues because it allows them to identify patterns, anomalies, or outliers and helps identify where issues lie in a stack. 

You can learn more about how distributed tracing can be applied to your life like, tracking a vacation, by reading the blog, Explain it like I’m five: distributed tracing.

How did distributed tracing come to be? 

In a monolithic world, workflow requests were easy to follow despite the application components being more complex — this made it easier to find where a problem is happening. However, in today’s cloud native world of microservices, things are reversed. Application components are simpler, but the request workflows are more complex. 

As this shift in complexity has continued, the need to understand where problems are happening has become harder to discern. Going back to our bakery metaphor: If Nichelle and Robin want to find out how many red velvet cupcakes are in stock at their local bakery, the workflow requests are each different between a monolithic vs a microservices setup. 

In a monolithic world, this would have been a simple workflow to one application that would then do several calls within that service to collect this data. But if we are in a microservices environment, and repeat that same action of requesting inventory of our cupcake store, the UI fires off a notice to multiple microservices simultaneously, and simultaneously receives this data back from each microservice. 

While each workflow request may show there is a problem, if one exists, it’s less likely to be intuitive where the problem is with the microservices environment. 

Early distributed tracing tools were hard to use 

The market responded to these architecture changes by building new specialized tracing tools, but these tools are rarely used. Why? The early wave of tracing tools were:

  • Hard to use
  • Adopted by technically advanced users — users who typically have a deep understanding of the architecture and tool
  • Didn’t provide the level of detail needed to easily discover where a problem exists 

For example, sampling — which allows you to make a decision on what data to show — is the right tool for some, but not right for others who need to store less detail. Bottom line is being able to set the intervals of sampling should be left up to the user and not the distributed tracing tool itself. 

Distributed tracing needs to be easy for novice and expert users alike 

Let’s go back to our bakery example of Nichelle and Robin, who wanted to find out the inventory status of red velvet cupcakes. If there’s a problem searching inventory, engineers will likely get an error message and an admin would get an alert via their metric data that there is a problem. If we had done this same request workflow in a monolithic manner—by instrumenting (instructing) each unit of work to send telemetry data back to our observability tool — it would take up valuable time, resources, and costs. And if the transaction has already occurred by the time we have identified there is a problem, it becomes much harder to narrow where the problem occurred. 

Add in the growing complexity of collaboration as teams within organizations get bigger, the entire process seems to be more black-box oriented rather than intuitive. Ensuring the right expert is assigned to a problem from the start is crucial for your business’s success and ensures a seamless experience for customers. 

How does distributed tracing benefit my organization?

Distributed tracing takes a two-pronged approach to benefiting your organization. 

  • The first is that metrics are an early warning system to let teams know there is a problem. This also makes it easier for novice members of teams to more quickly understand what’s going on rather than having to call in a power user when something goes wrong. 
  • Secondly, distributed tracing provides insights into services that allow development teams to know and understand things like poor system health or identify where bottlenecks are in the software stack. This all happens in conjunction with the early warning capability described above. Teams are provided with the data they need to restore services, deliver a positive end-user experience, and adhere to the organization’s service-level agreements.

Going back to my earlier example of Nichelle and Robin wanting to know how many red velvet cupcakes are in stock: With distributed tracing, teams are able to find out where in this request workflow there may be a problem. 

Why do I need distributed tracing?

Tracking, that’s why. Without tracking we can’t tell where or when something happened in workflows. 

But how does this apply to development teams? Here are some of the key reasons why implementing distributed tracing into your organization can aid you and, in turn, your customers. Distributed tracing:

  • Informs development teams about the health and status of deployed application systems and microservices
  • Identifies irregular behavior that results from scaling automation
  • Reviews how the average response times, error frequencies, and other metrics are reflected through the end-user’s experience
  • Tracks and records vital statistics on performance with user-friendly dashboards
  • Debugs and isolates bottlenecks within the system while addressing performance level issues at the code level
  • Recognizes and addresses the base cause of unexpected issues

How Chronosphere simplifies distributed tracing for any user

Finding where errors or latency have occurred in complex microservices environments is hard to do. It becomes even harder in the middle of the night when an inexperienced on-call engineer is trying to get services back online.

Chronosphere allows any engineer — not just power users — to seamlessly jump from an alert or a dashboard into related traces. Once there, engineers can quickly see where the source of the problem lies. By using a tool built with novice engineers in mind, any engineer can:

  • Easily visualize how errors in a microservice are impacting services upstream in the request. 
  • Request a statistical analysis of all of the traces in a time window, and compare them to traces in a time window a few minutes prior to the incident. This allows engineers to look for differences. 

Returning to our red velvet cupcake inventory example, I would want to compare two end user experiences:

  • Nichelle’s request took 1/5th of one second to get inventory data when she checked
  • Robin tried 5 minutes later and it took 1/10th of one second to get the same information 

In the end, Chronosphere combines metrics and traces to help engineers quickly find where  a problem is, so it can be fixed and your business can get back on track. 

So, why not see how Chronosphere is approaching the distributed tracing problem?

Interested in what we are building?