Chronosphere co-founder and CTO, Rob Skillington, talks about how to build and run a team that will ensure your organization has great observability.
On: Jun 10, 2021
Over the past few years, Prometheus has become the de facto open-source metrics monitoring solution. Its ease of use (i.e. single binary, out-of-the-box setup), text exposition format and tag-based query language (i.e. PromQL), and large ecosystem of exporters (i.e. custom integrations) have led to its quick and widespread adoption.
Many companies have implemented Prometheus to help solve their metrics monitoring use cases. However, those with large amounts of data and highly demanding use cases quickly push Prometheus to (and often beyond) its limits. To remedy this, Prometheus created a concept called remote storage, which integrates into existing systems using remote write and read HTTP client endpoints. Remote storage allows companies to store and query Prometheus metrics at larger scales and longer retention periods.
I broke down the process for building an observability team into three sections that cover responsibilities, people, and approach to team structures (distributed vs. centralized).
The central observability team supports the engineers and developers involved with delivering your service to end users. There are four major responsibilities they should be thinking about:
Questions organizations need to answer around who’s going to run the observability practice, where do they report into, and what kind of team structure they have include:
In a distributed SRE or observability functions, SREs are embedded in different teams around the organization. This is a great way to get started on observability, but there are challenges with this approach longer term:
A centralized observability function looks at challenges everyone is facing from one vantage point in order to:
Small observability teams have the toughest challenge due to their size, but they can succeed by following a few useful tips.
After identifying responsibility, people, and team structure, it’s time to think about some of the internal KPIs and metrics needed to build a strong muscle for observability at your organization.
These KPIs and metrics can be broken down into three categories:
Once you have everything in place, you’ll eventually want to upgrade your observability functions to a 2.0 or even 3.0 level. We’ll visit this topic in more detail in a future post, touching on areas such as automation, auto-remediation, “monitoring the monitor”, and controlling the data pipeline in areas such as aggregation.
Whether you’re just learning about what observability is, or already on your way to building an observability team, we’ve covered key considerations in any observability journey. Your next stop is researching purpose-built tools for helping your team monitor your cloud-native environment.
Request a demo for an in depth walk through of the platform!