The ability to adapt and innovate faster than the competition is critical and a continual improvement area for any business. Among today’s competitive advantage drivers are a nimble engineering organization with the talent and foresight to stay ahead and a robust observability strategy that ensures high-performing teams catch issues introduced to their apps and fix them fast. But what if something changes in an application or in your monitoring configuration, and it causes a huge increase in the amount of observability data you’re collecting?
When it comes to high-performing engineering, history shows the most successful teams make significantly more changes in a given week compared to peers, and those changes are deployed much more quickly. A few years ago, a Puppet State of DevOps Report calculated high IT performers deploy to production 46 times more frequently than low performers, and lead time for changes between high and low IT performers is 440x faster. Fast-forward a few years, and the 2021 Google State of DevOps report found that elite-performing organizations deploy changes 973x more frequently than low-performing organizations.
With so many deployments and changes, there are a lot of extra chances to introduce problems that cause organizations to experience a data explosion. How well they manage them can mean the difference between a business staying relevant and becoming obsolete. From my engineer’s perspective in the cloud native observability space, I see four things you can do to get ahead of an observability data explosion. I always think about understanding the problem itself as well as the shortcomings in some of the tools out there; the need to make a plan; and what to look for once you’ve rolled up your sleeves and started investigating technology that will help (Hint: quotas and tools to shape traffic are your best friends.)
Four Ways to Prevent Observability Explosions
1. Recognize the Danger of Observability Data Explosions
On top of that risk, the trend towards consumption-based pricing for vendor solutions means that when explosions do occur, we feel the full consequences acutely — after all, the same models that prize flexibility are built for the best-case scenario, where all growth is predictable and happens intentionally.
Data explosions can be dangerous to a business’ operational agility. But where is all that extra data coming from? The latest culprit is rampant cloud-native adoption: Companies are transitioning their architecture from monoliths to containers and microservices, and it’s increasing the severity of observability data growth. Data growth was already a challenge, and now with observability in cloud native environments it’s akin to pouring gasoline on a fire — the explosion just got bigger. Our CTO Rob Skillington talks about finding the right balance of observability data (aka the goldilocks zone) in his blog, What is high cardinality? Meanwhile, my colleague Ales talks about how to control data growth in his blog, How to wrangle metric data explosions.
Most businesses have faced or can imagine facing, this scenario: An individual on the team deploys a change to a service that adds a new dimension to a metric. The individual lets it run for a while to get some data to view with the change, but when that person views the results in the observability platform, uh oh, big surprise. Instead of the expected modest increase in data, the person made a mistake, and the cardinality of the metric has increased far more than expected.
And the problem isn’t limited to metrics. It’s just as easy to introduce a span or log line that gets emitted far more frequently than expected.
2. Understand the Limitations of Current Tools and Approaches
Here’s what the business can expect to happen next if it operates its own tooling or relies on most vendor’s solutions connected to the cloud and experiences the scenario just described:
- Service or app instability or downtime — If organizations operate their own observability tooling, even innocuous mistakes, like the one in the example above, can cause severe application instability or downtime. This impacts the business’s ability to understand the state of its production infrastructure and hampers productivity since teams have to restore their observability tools in addition to fixing the data explosion
- Costly and unexpected overage charges — If organizations rely on a vendor solution that’s connected to the cloud, common consumption-based pricing will likely cause an atypically high overage charge at the end of the month as a consequence.
Teams relying on open-source tools don’t have many good options to protect their organizations from the consequences of a telemetry explosion. Instead, they can monitor their typical volume and alert teams when there is a sudden uptick. But little to no open source tooling exists that actually protects their systems from a sudden massive increase in incoming data.
Teams trusting vendor solutions connected to the cloud are in a similar situation. They can typically configure alerts to let people know if there’s a change in how much data they are sending, but unless teams act quickly, the result is almost always a painful overage bill at the end of the month.
Finally, catching data explosions through extremely thorough code reviews is problematic. This approach is proactive, but relies on existing mental models of the systems in operation and presents an issue: We wouldn’t need to change instrumentation if we already understood our systems enough. Moreover, trying to compensate for potential problems with more exhaustive reviews can ultimately end up slowing development progress, which is a steep price to pay for a fundamentally flawed form of protection.
3. Don’t Believe Hope Is a Strategy
Observability data explosions are inevitable, and in today’s hyper-competitive marketplace, it’s insufficient to simply be extra careful and hope engineering teams avoid them. Although in-house engineers can attempt to build protections into their organization’s existing open-source or vendor-managed observability products, these complex efforts can quickly become expensive and take time away from core engineering initiatives more valuable to the business.
As teams now strive to make changes to systems and their accompanying instrumentation and monitoring more often — which increases the likelihood that observability data explosions will occur — ask yourself: What protections can your organization put in place to prevent or minimize their impact?
4. Investigate a Cloud Native Platform to Tamp Down Data Explosions
A production-grade, cloud native observability platform such as Chronosphere will handle observability data explosions in a graceful manner. It will include built-in protections so that when there is a massive increase in data, it remains not only stable and accessible but also has predictable costs based on what the organization has deemed appropriate.
In order to support this requirement, businesses considering a production-grade observability platform should look for specific capabilities:
- The ability to set rate limits on how much data gets stored, so data explosions don’t threaten the stability of the organization’s products or result in unexpected costs.
- The ability to set quotas for individual services or teams, so that the consequences of data explosions are contained to where they are introduced.
- Functionality to help teams more easily diagnose where unexpected spikes in data are coming from, along with tools to re-shape incoming data before storage costs are incurred, without changing instrumentation.
- More advanced capabilities, such as the ability to prioritize what specific data is kept or dropped when rate limits are enforced.
Learn More about Observability Data Explosion Solutions
Observability data explosions with unpredictable costs don’t have to cause organizational headaches. Chronosphere, a highly scalable, cloud native observability platform, has built-in, sophisticated controls to help manage your observability data in a predictable and efficient manner.