By Moshe Sambol (Lightrun) and Gilles Ramone (Chronosphere)
Organizations are moving to microservices- and container-based architectures because these modern environments enable speed, efficiency, availability, and the power to innovate and scale more quickly. However, when it comes to troubleshooting distributed cloud native applications, teams face a unique set of challenges due to the dynamic and decentralized nature of these systems. To name a few:
- Lack of visibility: With components spread across various cloud services and environments, gaining comprehensive visibility into the entire system can be difficult. Access to production environments is generally strictly limited to ensure the safety of customer-facing systems. This makes it challenging to understand run-time anomalies and identify the root cause of issues.
- Complexity: Distributed systems are inherently complex, with numerous microservices, APIs, and dependencies. Understanding how these components interact and affect one another can be daunting when troubleshooting.
- Challenges with container orchestration: When using serverless systems and container orchestration platforms like Kubernetes, processes can be ephemeral, making it very challenging to identify the resources related to specific users or user segments, and to capture and analyze the state of the system relevant to specific traffic.
- Cost of monitoring and logging: Setting up effective monitoring and logging across all components is crucial, but it is costly to aggregate and complex to correlate logs and metrics from various sources.
Addressing these challenges requires a combination of a robust observability platform and tooling that simplifies complexity and helps developers understand the behavior of their deployed applications.
These tools must address organizational concerns for security and data privacy. The best observability strategy will enable the ongoing “shift left” — giving developers access to and responsibility for the quality, durability, and resilience of their code, in every environment in which it runs. Doing so will enable a more proactive approach to software maintenance and excellence throughout the software development lifecycle.
Efficient troubleshooting requires not just gathering data, but making sense of that data: Identifying the highest priority signals from the vast quantity and variety produced by large deployments. Chronosphere turbo-charges issue triage by collecting and then prioritizing observability data, providing a centralized observability analysis and optimization solution.
Rather than aggregating and storing all data for months, at ever increasing cost to store and access, Chronosphere pre-processes the data and optimizes it, substantially reducing cost and improving performance.
When leveraging Chronosphere together with Lightrun, engineers are rapidly guided to the most incident-relevant observability data that helps them identify the impacted service. From there, they can connect directly from their local integrated development environment (IDE) via the Lightrun plugin to debug the live application deployment. With Chronosphere’s focus and Lightrun’s live view of the running application, developers can quickly understand system behavior, complete their investigation at minimal cost, and close the cycle of the troubleshooting process.
Chronosphere + Lightrun: A technical walk through
Ready to see Chronosphere’s ability to separate signal from noise in a metrics-heavy application, and Lightrun’s on-demand, developer-initiated observability capabilities in action?
To demonstrate, we’re going to use a small web application that’s deployed to the cloud and under load. Lightrun’s application provides simple functionality — users are presented with a list of pictures and they can bookmark those they like most.
In this example, we’ve been alerted by Chronosphere about something amiss in our application’s behavior: It seems that some users are experiencing particularly high latency on some operations. Chronosphere pinpoints this to the “un-like” operation.
But why only some users?
The app designers are doing some A/B testing to see how users react to various configurations that may improve the site’s usability and performance. They use feature flags to randomly select subsets of users to get slightly different experiences. The percent of the audience exposed to each feature flag is determined by a config file.
Unfortunately, in our rush to roll out the feature flag controlled experiments, we neglected to include logging, so we have no information about which users are included in each of the experiment groups.
The feature flags — possibly individually, possibly in combination — may be causing the latency that Chronosphere has identified. In order to know for sure, we’ll need to add some logging, which means creating a new build and rolling out an update. Kubernetes would let us do this without bringing down the entire application, but that might just further confuse which users are getting which feature flags, so it seems that some down time may be our best option.
Well, that would be the situation without Lightrun.
Since we’ve deployed Lightrun’s agent, we can introduce observability on-demand, with no change to our code, no new build, and no restarts required. That means that we can add new logging to the running system without access to the containers, without changing any code.
We can safely gather the application state, just as we’d see it if we had connected a debugger, without opening any ports and without pausing the running application!
Lightrun provides remote, distributed observability directly in the interface where developers feel most at home: their existing IDE. With Lightrun’s IDE plugins, adding observability on the fly is simply a matter of right-clicking in your code, choosing the pods of interest, and hitting submit.
Back to the issue at hand, we’ll use a dynamic log to get a quick feel for who is using the system. Lightrun quickly shows us that we’ve got a bunch of users actively bookmarking pictures. By using Lightrun tags, we’re able to gather information from across a distributed deployment without needing details of the running instances.
That’s nice, but it’s still hard to tell what’s going on with a specific user who’s now complaining about the latency. We use conditional logging to reduce the noise and zoom in on that specific user’s activity. From there we can see that their requests are being received, but we still need to answer the question: Qhat’s going on?
What we really want is a full picture, including:
- This user’s feature flags
- The list of items they’ve bookmarked
- And anything else from the environment that could be relevant.
Enter Lightrun snapshots — virtual breakpoints that show us the state without causing any interruption in service.
Creating a snapshot is just as easy as adding a log — we choose the tags that represent our deployment, add any conditions so that we’ll just get the user we’re interested in — regardless of which pod is serving that user at the moment. And there we have it, all of the session state affecting that user’s interaction with the application.
With this information we can see that one of our feature flags is to blame — it looks like it’s only partially implemented. It’s a good thing that only a small percentage of our audience is getting this one! Oops.
Before we roll out a fix, let’s get an idea of how many users are being affected by each of our feature flags. We can use Lightrun’s on-demand metrics to add counters to measure how often each block within our code is being reached. And we can add tick-tocks to measure the latency impact of this code, just in case our experimentation is also slowing down the site’s responsiveness.
Watch the full technical walkthrough below:
The Chronosphere and Lightrun combined solution
It’s imperative to have all observability data going into a cloud native observability platform like Chronosphere, which helps alert us to the needle in the haystack of all the telemetry our distributed applications are producing. And with Lightrun developers are able to query the state of the live system right in their IDE, where they can dynamically generate additional telemetry to send to Chronosphere, for end to end analysis.
By using these solutions together, we leverage the unique capabilities provided by each. The result is full cloud native observability: Understanding what’s going on in our code, right now, wherever it is deployed, at cloud native scale. Zooming in on the details that matter despite the complexity of the code and the deployment. Combining new, on-demand logs and metrics with those which are always produced by our code — for control, cost management, and automatic outlier alerting.
With developer-native, full-cycle observability, these powerful tools are supporting rapid issue triage, analysis, and resolution. This is essential to organizations realizing maximum observability benefits while maintaining control over their cloud native costs.