4 key observability best practices to know

Green Technology Image preview card
ACF Image Blog

With bigger systems, higher loads and more interconnectivity between microservices in cloud native environments, everything has become more complex.  Cloud native environments emit somewhere between 10 and 100 times more observability data than traditional, VM-based environments.

6 MINS READ

As a result, engineers aren’t able to make the most out of their workdays, spending more time on investigations and cobbling together a story of what happened from siloed telemetry, leaving less time to innovate.

Without the right observability set up, precious engineering time is wasted trying to sift through data to spot where a problem lies, rather than shipping new features — potentially introducing buggy features and affecting the customer experience.

So how can modern organizations find relevant insights in a sea of telemetry and make their telemetry data work for them, not the other way around? Let’s explore why observability is key to understanding your cloud native systems and four observability best practices for your team.

What are the benefits of observability?

Before we dive into ways your organization can improve observability, lower costs, and ensure smoother customer experience, let’s talk about what the benefits of investing in observability actually are.

Better customer experience

With better understanding and visibility into relevant data, your organization’s support teams can gain customer-specific insights to understand the impact of issues on particular customer segments. Maybe a recent upgrade works for all of your customers except for those under the largest load or during a certain time window. Using this information, on-call engineers can resolve incidents quickly and provide more detailed incident reports.

Better engineering experience and retention

By investing in observability, site reliability engineers (SREs) benefit from knowing the health of teams or components of the systems to better prioritize their reliability efforts and initiatives.

As for developers, benefits of observability include more effective collaboration across team boundaries, faster onboarding to new services/inherited services, and better napkin math for upcoming changes.

Four observability best practices

Now that we have a better understanding of why teams need observability to run their cloud native system effectively, let’s dive into four observability best practices teams can use to set themselves up for success.

1. Integrate with developer experience

Observability is everyone’s job, and the best people to instrument it are the ones who are writing the code. Maintaining instrumentation and monitors should not be a job just for the SREs or leads on your team.

A thorough understanding of the telemetry life cycle — the life of a span, metric, or log — is key, from setting up configuration to emitting signals and any modifications or processing done before getting stored. If there is a high-level architecture diagram, engineers can better understand if or where their instrumentation gets modified (like aggregating or dropping). Often, this processing falls in the SRE domain and is invisible to developers, who won’t understand why their new telemetry is partially or entirely missing.

You can check out simple instrumentation examples in this OpenTelemetry Python Cookbook.

If there are enough resources and a clear need for a central internal tool, platform engineering teams should consider writing thin wrappers around instrumentation libraries to ensure standard metadata is available out of the box.

Viewing changes to instrumentation

Another way to enable developers is by providing a quick feedback loop when instrumenting locally, so that they can view changes of instrumentation before merging a pull request. This recommendation is helpful for training purposes and for those teammates who are new to instrumenting or unsure about how to.

Updating the on-call process

Updating the on-call onboarding process to pair a new engineer with a tenured one for production investigations can help distribute tribal knowledge and orient the newbie to your observability stack. It’s not just the new engineers who benefit. Seeing the system through new eyes can challenge seasoned engineers’ mental models and assumptions. Exploring production observability data together is a richly rewarding practice you might want to keep after the onboarding period.

You can check out more in this talk from SRECon, “Cognitive Apprenticeship in Practice with Alert Triage Hour of Power.”

2. Monitor observability platform usage in more than one way

For cost reasons, becoming comfortable with tracking the current telemetry footprint and reviewing options for tuning — like dropping data, aggregating or filtering — can help your organization better monitor costs and platform adoption proactively. The ability to track telemetry volume by type (metrics, logs, traces, or events) and by team can help define and delegate cost-efficiency initiatives.

Once you’ve gotten a handle on how much telemetry you’re emitting and what it’s costing you, consider tracking the daily and monthly active users. This can help you pinpoint which engineers need training on the platform.

These observability best practices for training and cost will lead to better understanding the value that each vendor is providing you, as well as what’s underutilized.

3. Center business context in observability data

Deciphering the business context in a pile of observability data can help shortcut high stakes in different ways:

  • By making it easier to translate incidents affecting workflows and functionality from a user perspective
  • By creating a more efficient onboarding process for engineers

One way to center business context in observability data is by renaming default dashboards, charts, and monitors. For example, rather than have a dashboard full of default-named metrics from a Redis cache that could refer to anything, engineers should name alerts and dashboards/charts indicating their business/customer use such as recently downloaded photos.

4. Un-silo your telemetry

Teams need better investigations. One way to ensure a smoother remediation process is through an organized process like following breadcrumbs rather than having 10 different bookmark links and a mental map of what data lives where.

One way to do this is by understanding what telemetry your system emits from metrics, logs, and traces and pinpointing the potential duplication or better sources of data. To achieve this, teams can create a trace-derived metric that represents an end-to-end customer workflow, such as:

  • “Transfer money from this account to that account.”
  • “Apply for this loan.”

Regardless of whether you’re sending to multiple vendors or a mix of DIY in-house stack and vendors, ensuring that you are able to link data between systems — such as adding the traceID to log lines, or a dashboard note with links to pre-formatted queries for relevance — will add that extra support for your team to perform better investigations and remediate issues faster.

Explore Chronosphere’s future-proof solution

Engineering time comes at a premium. The more you can invest in getting high-fidelity insights and supporting engineers to understand what telemetry is available, instrumenting will become fearless, troubleshooting faster and your team can make future-proof, data-informed decisions when weighing options.

As companies transition to cloud native, uncontrollable costs and rampant data growth can stop your team from performing successfully and innovating. That’s why cloud native requires more reliability and compatibility with future-proof observability.

Take back control of your observability today, and learn how Chronosphere’s solutions manage scale and meet modern business needs.

Share This:
Table Of Contents

Ready to see it in action?

Request a demo for an in depth walk through of the platform!