How to implement monitoring standards at scale

on June 16th 2022

As with every system you build, trying to scale the metrics platform is a non-trivial task. Like every other system, scaling happens across multiple axes simultaneously, including a user adoption axis, and a technical axis. In this blog post, we’ll dive into some of the non-technical challenges observability teams face as they scale the use of their metrics platform across an organization. By the way, if you’re more interested in building an observability team, we also have you covered.

Scaling user adoption across the organization

First up is scaling user adoption. Because monitoring is a cross-cutting concern, many different teams across the organization have to deal with monitoring their application, services, and components.

That means that many teams have to do the same work, introducing the risk of inconsistency, re-invention of the wheel, and duplication of work.

Scaling metrics systems, from this organizational perspective, is prone to risks similar to other cross-cutting concerns like security, cloud, platform engineering, and CI/CD pipelines. Dealing with these challenges is vital to the success of user adoption of such platforms, even though the ways we deal with them are very generic and interchangeable between them.

Solve a cross-cutting problem 

Obviously, it all starts with the observability team solving a problem that engineering teams actually have. This common denominator problem should be non-trivial and outside of the engineering team’s direct responsibility. Building and maintaining the observability platform itself is a great example – otherwise each engineering team would have to build their own. For the observability team, building a metrics platform is a core business, but for the engineering teams using the metrics system, it’s not. For them it’s non-differentiating busywork and toil.

By moving that work from each individual engineering team to a centralized function, the observability team can deliver a higher quality experience to everyone by delivering a single common, shared observability platform. The economies of scale allow the central team to better safeguard cost-effectiveness, security, operational efficiency, and more.

A single shared platform helps create visibility across the entire organization that teams wouldn’t have if they all ran their own siloed metrics systems; an approach that’s particularly problematic when solving issues that involve multiple teams and services.

The common and shared platform also allows the observability team to deliver best practices and guardrails in successfully using the platform, standardizing the way teams deal with metrics data, queries, and dashboards. This includes creating documentation, (pre-)instrumenting common services, helping teams stay within their metrics budget. Standardization has the added benefit of creating a consistent experience, which is much needed as the usage of metrics data increases – both across the organization as within each team.

Reduce friction for teams to get started

Not only does this help with consistent implementation across the landscape, it also helps with onboarding teams, speeding up the instrumentation phase, and frees up teams to spend on more valuable work, instead reducing resources and effort spent on the common layers of the infrastructure.

The observability team, in addition to delivering a shared metrics platform as a service, is responsible for defining the standards and practices – and making sure they are successfully adopted by engineering teams. Successful adoption rates are increased if the observability teams create scaffolding for which metrics will use (which) tags or labels. This means setting standards for those across the service landscape, creating standards for instrumentation, and pre-instrumenting common components or services. It also involves creating documentation, how-to guides, facilitating training and onboarding, and gathering feedback continuously to improve and more closely align to the engineering teams.

They’re uniquely positioned to create internal frameworks that standardize how to instrument (which includes things like default dashboards and sane alerts) common services, like container platforms, database services, load balancers, and more. This helps limit the friction of getting started and using the system by removing the variation and duplication of many teams each doing the same boilerplate work a little differently. This frees up teams to spend on more valuable, business-facing work.

This reduction of friction in using and being successful with the metrics platform makes it easier for the next team (and the one after that) to actually start using the platform, reducing the variance of different teams using different solutions, and increasing the adoption of a standardized, consistent way of work across the organization. 

A large part of being successful is the long-term adoption of the metrics platform. In many case, after the honeymoon period is over, teams need help to further improve and stay successful in using the platform; something that requires the expertise of observability engineers that know how to deal with the ever-increasing entropy and complexity of managing large data sets as engineering teams increase their usage of the metrics platform. 

The observability team’s superpowers help teams tame metrics data growth and complexity when the engineering teams inevitably add dimensions and increase cardinality, resolution, or retention as time passes.

Give flexibility where needed

However, no team is really alike. Engineering teams, while they can benefit immensely from consistency and standardization of the boilerplate metrics best practices, need flexibility to adjust the best practices to what they need, instead of vice versa.

To know where to enjoy the experience of standardization, and where to deviate from the beaten path, teams must have an understanding of their place in the landscape, and know what part their services play in the organization’s value streams. These help inform teams what parts of their service they need to instrument at what level, and where they may need to deviate from the standards. While there are various activities teams can engage in to create a better understanding of their place in the application services landscape (and web of dependencies), diving into those methodologies is a bit much for this blog post. If you want to know more about this, read this white paper that delves into understanding and taming high cardinality (look for the “Understanding the landscape” section).

The framework of best practices created by the observability team is just that: scaffolding that provides structural support. But the framework shouldn’t be so prescriptive that teams cannot deviate; the goal is to create a framework that helps teams to get to 80% quickly and consistently by using languages that have pre-built support for automatic instrumentation, service scaffolding templates, or leveraging side car service mesh architectures. The goal of the 80% rule is to provide a basic, consistent implementation across common technologies and shared infrastructure, so that teams don’t have to spend time and effort on non-differentiating work, and instead can focus on the 20%.

The 20% is likely highly specific to each individual team, which is where teams need the freedom and flexibility to deviate from a standard. The observability team acts as a trusted advisor to help teams reduce the exponential growth that high-cardinality data sets pose, and helps optimize the insights driven by this 20%.

This approach balances the concept of “giving teams a fish” for the boilerplate instrumentation with “teaching teams to fish” so they learn to take care of the highly-specific instrumentation and metrics themselves. Teaching all teams the same way of work, with the same best practices, has the added benefit of consistency, making it easy for teams to explore and observe adjacent services that they don’t own. This contributes to building a uniform and shared understanding across the services landscape, and helps keep the entire system maintainable and future-proof.

Taming data growth and cardinality

In this blogpost, we looked at some of the non-technical challenges observability teams face as they scale the use of their metrics platform across an organization. However, we barely touched on technical aspects like taming cardinality or managing data growth. If you want to dive right into how to scale the metrics platform, all you have to do is follow this link to Balancing cardinality and cost to optimize the observability platform to learn how to keep it alive and well as the adoption increases across the organization.

Interested in what we are building?