5 ways to keep your observability healthy in 2023

Jess Lulka
on January 16th 2023

This time of year, it’s common to see ads and messaging promoting self improvement and making the new year your best yet. It’s easy enough to figure out personal goals, but what about aspirations for observability and cloud native technology health? 

Here’s five ways organizations can keep their cloud native environment running smoothly and in shape. 

1. Wrangle observability data growth 

Keeping on top of observability data growth is an essential part of cloud native environments. A side effect of adopting cloud native is that it creates a lot of unusable observability – essentially junk. This junk data is created by many individuals throughout the organization without much thought about its deliverable value or system impact. Left unchecked, observability data grows astronomically without adding value. 

To make sure data volumes don’t become overwhelming, there are several tactics you can adopt to get this back under control. Aggregation rules help cloud native organizations reduce metric cardinality and volume. This process happens after data is collected but before it is  stored and lets teams only store useful data. The result is improved query performance and reduced costs. 

Quotas are another way for users to control how fast data is coming in – and avoid potential performance issues. With rate limits and data stream visibility, it’s a lot easier for teams to catch or prevent data explosions before they occur. Chronosphere’s Profiler lets teams see data generated in real time and get alerts if a specific service, application, team, or collection is nearing or exceeding volume limits. 

2. Monitor to control costs and avoid overages  

An increasingly common complaint about observability tools is that their cost can start to outweigh their value and the bill can also quickly add up if users aren’t aware of every possible charge. As more organizations adopt cloud native and observability, they should take the time to evaluate the potential costs and acknowledge there are factors beyond physical dollars. 

Observability pricing has less guidelines than cloud deployments, but that doesn’t mean it’s impossible to figure out potential costs. The first step is for you to figure out what you can measure in terms of services, specific teams, and function spend. Next is determining any usage costs that might occur as part of data usage – such as read/write, storage, and access. With this information, you can get a better idea of what to include in usage reports and what to track for spending.    

To help evaluate return on investment (ROI) for observability, teams should figure out: 

  • What portion of spend each service/team makes up for. 
  • The top sources of overall spend, such as most expensive metrics. 
  • How does the volume of collected data fluctuate? 
  • Rate of data growth vs. infrastructure growth.

Beyond ROI evaluation, observability leaders must keep close track of which teams are consuming the most capacity to help curb potential data use and storage costs. With Chronosphere, the platform’s features let users see where data is going, what teams and services collect the most data, and type of ingested data.  

Getting these usage metrics helps organizations see what data is driving their observability costs – and lets teams see how much their data is costing them – before it adds to their bottom line. 

3. Establish objectives to keep observability reliable 

For companies that run cloud native environments it is essential to meet service-level agreements. Whether it’s an SLI, SLA, or SLO, your team should know what each agreement entails and what processes are in place to meet them.  

Figure out what your organization’s different levels of service mean, what tools are needed to reach them, and what should happen if you do fall short of your contracts. As teams form service agreements, it helps to keep in mind: 

  • If the agreement is realistic for the team to meet. 
  • Get all relevant stakeholders involved. 
  • Use historical data to see if the service can meet the agreement. 
  • Be flexible as architecture and agreements change. 

To ensure that your services are reliable, you must have an observability solution that is as reliable or more than your promised service-level. For example, Chronosphere has its own internal SLOs to make sure that customers can get at least 99.9% reliability, although our track record greatly exceeds this. The platform makes sure all data meets requirements for availability, performance, and correctness. 

With these internal standards, the Chronosphere platform provides features that are trustworthy, efficient, and accurate for users. Aligning each software capability with these objectives means that users know their observability platform can provide the best, up-to-date information on how reliable their cloud native environment is.  

4. Set actionable and contextual alerts to ensure uptime 

Downtime has both financial and authoritative consequences – which means organizations want to avoid it at all costs. Observability gives teams a holistic view of their cloud native environment and lets teams get all necessary alerts. But organizations must have the right features so engineers can find any essential information as quickly as possible when it’s relevant. 

To keep a cloud native environment online, engineers want to know the urgency and importance of every alert that comes across their dashboards along with the necessary context to remediate the issue. An observability solution should not only include this info, but also have the right workflows and search functions set up so that any potential uptime issues reach the right technical staff. 

Chronosphere’s platform collects all potential alerts through its Monitors functionality and adds an extra layer of information with Alert History. This helps users that work with open source alerting software gain extra context for any alerts and find what information they need as quickly as possible. 

Users can search through alerts by activity to find its source, its current status, and type so that they can effectively triage any issues before they impact customers – and get it resolved as quickly as possible.  

5. Focus on observability outcomes, not inputs to hit MTTR goals 

With observability, getting the alert is just the first step in the journey and it’s often a much longer process to return everything to an operable state. According to Chronosphere’s 2023 Cloud Native Observability Report: Overcoming Cloud Native Complexity, only 1% of companies are meeting their (mean time to repair) MTTR targets. 

Part of this is because engineers don’t necessarily have the right information to effectively triage an issue (see point 4). But organizations can redefine how they think about MTTR for today’s cloud technologies that focus more on business outcomes and customer impact as opposed to just the technology. 

Instead of simply looking at the data of observability (metrics, traces, and logs), organizations should be focused on the outcomes that observability provides. These three phases – know, triage, understand – provide a better process for engineers to resolve any infrastructure issues.

With the three phases, engineers can focus on what real-time data will help relieve the customer issues, as opposed to trying to perform a full root-cause analysis while simultaneously dealing with customer-impacting service failure.     

Collecting observability telemetry isn’t enough to hit MTTR goals; engineers must be able to effectively use it. With a focus on outcomes instead of data types, organizations can use their observability tooling to get the right insights for faster incident resolution.  

To see what features you can use to keep your observability healthy, contact us for a demo.

Interested in what we are building?