What does building a world-class observability team look like?
Over the past few years, Prometheus has become the de facto open-source metrics monitoring solution. Its ease of use (i.e. single binary, out-of-the-box setup), text exposition format and tag-based query language (i.e. PromQL), and large ecosystem of exporters (i.e. custom integrations) have led to its quick and widespread adoption.
Many companies have implemented Prometheus to help solve their metrics monitoring use cases. However, those with large amounts of data and highly demanding use cases quickly push Prometheus to (and often beyond) its limits. To remedy this, Prometheus created a concept called remote storage, which integrates into existing systems using remote write and read HTTP client endpoints. Remote storage allows companies to store and query Prometheus metrics at larger scales and longer retention periods.
I broke down the process for building an observability team into three sections that cover responsibilities, people, and approach to team structures (distributed vs. centralized).
Observability team responsibilities
The central observability team supports the engineers and developers involved with delivering your service to end users. There are four major responsibilities they should be thinking about:
- Defining monitoring standards and practices.
Because this responsibility is not just about defining what you’re doing, but also defines the why as well as the how, this area includes documentation, guides, and best practices. - Delivering monitoring data to engineering teams.
Take what you’ve defined, make sure that it’s available to everyone, and do so in a way that will empower them to perform monitoring functions themselves. - Measuring reliability and stability of monitoring solutions.
It is critical to establish and maintain trust that systems will be available when needed. - Manage tooling and storage of metrics data.
Make it simple: If it takes a ninja, people won’t use it.
Who’s in charge of your observability?
Questions organizations need to answer around who’s going to run the observability practice, where do they report into, and what kind of team structure they have include:
- Who? Can an internal person take this on, or do we need to go outside the organization?The team may include a mix of people with internal knowledge and context, and others with experience at established SRE (Site Reliability Engineer) practices.
- Where? Should they report into SRE, operations, platforms, or engineering?The answer is … it depends! If there is an SRE practice, that’s the logical place. Otherwise look at some centralized function that can act as a governance overlay.
- When? At what point do we need to make this a full-time role vs. someone’s part-time job?It’s time for a full-time role if you are missing Service Level Indicators (SLIs)/Service Level Objectives (SLOs). Also, if customers are finding out about problems before you, or if you are ramping up scale in cloud-native.
What does a distributed SRE or observability team look like?
In a distributed SRE or observability functions, SREs are embedded in different teams around the organization. This is a great way to get started on observability, but there are challenges with this approach longer term:
- Teams begin to diverge in how they perform monitoring and observability.
- It’s challenging to get a global view across all of the teams, operations, and systems.
- Duplicate effort occurs, which means wasted effort.
What does a centralized observability function look like?
A centralized observability function looks at challenges everyone is facing from one vantage point in order to:
- Define monitoring standards and practices.
- Provide monitoring data to engineering teams.
- Manage tooling and storage of metrics data.
- Ensure reliability and stability of monitoring solutions.
How can a small observability team be successful?
Small observability teams have the toughest challenge due to their size, but they can succeed by following a few useful tips.
- Think through what your organization is emitting telemetry data for and how you consume the telemetry data.
- More data is not better — sheer volume might make your life more difficult.
- This type of audit can reveal what observability looks like in your company to help determine where resources should be spent.
- Teach DevOps teams how to fish.
- Delegate in order to give teams consuming observability tools a way to get their work done without relying on your progress.
- Don’t build if you don’t have to!
- COTS (commercial off the shelf) tools can help.
- Look for the most user-friendly vendor or open source tooling.
- Maximize future proofing — this will limit the pain of moving between different data formats.
What KPIs and metrics should you keep in mind?
After identifying responsibility, people, and team structure, it’s time to think about some of the internal KPIs and metrics needed to build a strong muscle for observability at your organization.
These KPIs and metrics can be broken down into three categories:
- How many FTEs should be on the observability team? One of the most basic questions you want to answer is, how many full-time employees (FTEs) should be working on this team?
- When we first set up Graphite monitoring at Uber, there were initially 2 dedicated FTEs out of 100. Later we grew to 5 FTEs in 500, and eventually grew to 50 in 2500. Looking back, my takeaway is it’s optimal to dedicate 2% of your engineering headcount to observability, whether that’s shared or dedicated.
- How much time and infrastructure costs should we invest into observability?
- What you invest in observability has economies of scale. Because observability is a core function, if you’re not spending a significant amount on it (at least a few percentage points of your cloud bill), you’ll likely run into significant reliability challenges and concerns.
- How do you measure success of your observability investment? Answering this question takes several things into account.
- Do people at the company feel like they have SLOs that are in place and being met? You’ll want to have best practices regularly implemented by end user teams and customers.
- Conduct NPS (net promoter score) surveys at a regular cadence in order to get feedback from folks using your systems.
- Measure and be aware of every incident and how much a lack of monitoring may have played a role. Your observability function correlates directly with detecting a problem, your ability to recover from it, and learning how much downtime your systems have experienced. Ideally teams should also arrive at automatically rolling back a high percentage of bad changes thanks to observability investments. As an example from personal experience at Uber, it was estimated that more than 60% of incidents could be traced back to a single change made to the system. This investment in advanced monitoring represented one of the highest points of leverage since we could automatically roll back the vast majority of production impacting changes before they caused any real world impact. You can try this yourself by automatically rolling back any change made to production when any of your alerts trigger, either during a rollout of a change or soon afterwards.
Where do we go from here — Observability 2.0 and beyond
Once you have everything in place, you’ll eventually want to upgrade your observability functions to a 2.0 or even 3.0 level. We’ll visit this topic in more detail in a future post, touching on areas such as automation, auto-remediation, “monitoring the monitor”, and controlling the data pipeline in areas such as aggregation.
Whether you’re just learning about what observability is, or already on your way to building an observability team, we’ve covered key considerations in any observability journey. Your next stop is researching purpose-built tools for helping your team monitor your cloud-native environment.