This year’s Predict ‘22 conference was packed with savvy forecasts about the year ahead, including one from Martin Mao, our co-founder and CEO. During his forecasting session, Martin spoke on a topic near and dear to Chronosphere’s heart: observability data growth, which he predicts will reach a tipping point in the next year.
Here are key takeaways from Martin’s talk, including four approaches to taming observability data growth.
Observability data growth outpacing the business
Data growth – particularly observability data growth – is heading towards a tipping point. If we look at how the sheer volume of observability data has grown over the years – and we look at how companies have transitioned their architecture from on-premises to a VM-based cloud architecture to a cloud-native architecture – we see observability data growth is far outpacing the rate at which the general business – or the infrastructure footprint – is growing.
Benefits to going cloud native: fast and reliable
To explain how data will reach a tipping point, Martin illustrates an example scenario about an organization that a few years ago was running applications on VMs. Flash forward to today: “You’re running a Kubernetes cluster on those same VMs. You’re trying to go cloud native and you’ve broken up that application into microservices instead, and you’re running those on the containers on the same cluster.”
At this point, he explains, your infrastructure footprint and your bill for that cluster doesn’t really change that much:
- The cloud providers don’t charge you that much to run Kubernetes on top of the underlying VMs.
- You’re really paying for the underlying VMs, so your infrastructure footprint hasn’t changed very much.
- With the same cluster size, you’re pretty much serving the same business use cases.
Your business probably hasn’t changed much at this point. At the same time, there are advantages with this new cloud native architecture, such as:
- You can deploy multiple times a day
- Your architecture is a lot more reliable
Cloud-native challenge: data growth
Enter the data tipping point: One of the unfortunate side effects of a cloud-native architecture is that it produces a lot more data, Martin explains.
“There are tens of these containers running on top of every VM. And then on top of that, they’re very ephemeral. They’re changing all the time, and every time they change, it’s a brand new container. You can just imagine the sheer amount of data that produces, and it’s on the same infrastructure footprint.”
Moving up the stack
The same explosive thing happens with your single monolithic application, Martin explains, which has been broken up as tiny microservices. Each microservice emits an order of magnitude of data similar to the original application, and now you have a lot more of those as well: “So you can see very quickly how adopting cloud-native architecture results in so much more data being produced,” says Martin, explaining that the downside of this is, “as that data volume starts to grow, generally the cost of the observability system rises in a correlated manner” because:
- You’re either running these observability systems yourselves –
- Result: More data means more hardware to these systems, which increases the cost.
- Or you use a managed vendor for your observability solution –
- Result: Vendor solutions charge based on volume of data. “You can imagine underneath the covers, these vendors also have to store all of that data – that’s their cost model,” explains Martin.
The high cost of observability data
Cost rises are correlated with the rise in observability data. This means that as the overall cost rises, the relative cost of observability-to-infrastructure is also much higher. “When we talk to companies, we hear that within the infrastructure bill, observability is around number two or three and it’s growing very quickly,” Martin said. The problem for businesses ultimately centers around lack of value, he said while noting:
- If the observability data was growing, and companies were getting better visibility into everything they’re producing, time to repair, or time to detection – companies would be satisfied because they would get value with the cost.
- Instead, we’re seeing that data is growing at a fast pace and the cost is growing, yet people aren’t getting better value. When we ask companies what is your mean time to detection (MTTD) or your mean time to repair (MTTR) – they are often getting worse. Companies are paying a lot more, and getting less return on that particular investment.
“We believe that cost disparity will reach a tipping point that is going to force companies to do something about it,” said Martin.
Four techniques to tame data growth
According to Martin, the overall approach to solving the data growth problem is to understand the outcomes. This means inspecting the data, understanding what you are using the data for, and optimizing the data for those use cases.
Martin lays out four techniques to help with data growth: retention, resolution, efficient storage, and aggregation.
One way to efficiently manage data is through retention. To illustrate, Martin describes a typical environment in which there is one set retention period of 13 months for all of your data, whether you run your systems in-house or you use a managed provider. This means every piece of data that gets produced, no matter how you want to use it, is retained for 13 months.
“That may be useful for some pieces of data, but in the modern cloud-native architecture, where we are deploying multiple times a day, and a container is only around for a couple of hours, a huge amount of that modern observability data does not need to be retained for 13 months.”
Taking the retention example further, Martin explains:
- If you are tracking the CPU of a particular pod, and you have that down to the unique pod ID level, it is a waste to store that data for 13 months. Rather, storing that data for a day is enough.
- In contrast, if you do want to store some data over time for capacity planning or trend analysis, it makes sense to store that across a cluster or particular set of services.
- Changing the retention period of that subset of data from a year down to a single day will have a 365x reduction in the amount of storage that you need.
“Retention is not just setting the retention across all of your data,” said Martin. “It’s understanding each use case – for each subset of the data that serves those use cases – and understanding the optimal retention period.”
The second data-optimizing technique – resolution – refers to the frequency data is being emitted. “Am I tracking the CPU every 10 seconds versus every minute versus every hour?” Martin breaks down how to think about resolution in two examples:
- In a continuous integration/continuous delivery (CI/CD) use case – you do automated deploys, so tracking every second or 10 seconds makes a lot of sense. “You need to know in real-time what is happening with your CPU so that you can roll back a deploy, or roll back a particular change.”
- In contrast, other use cases – such as capacity planning or long-term trend analysis – don’t require that data down to the per-second basis.
Martin notes: As with retention, people often make the mistake of using defaults, which range from 10 seconds to a minute. However, he says, it’s critical to “understand your sub-use cases, and optimize the data for resolution for that use case.
“If we are measuring every 10 seconds versus measuring every minute, there is a 6x difference in the amount of data that needs to be produced and stored.”
3) Efficient storage
According to Martin, he and Rob Skillington – Chronosphere’s CTO and co-founder – saw first-hand the importance of efficient storage while they were running the observability team at Uber. Martin explains:
- A lot of data for observability is time series data – which means it’s a measurement of something over a period of time. Using relational databases, or key value stores, or blob stores are not efficient ways to store this data, says Martin. “Instead, you need a storage solution that is purpose-built for this type of solution. Generally those are time series databases.”
- Martin also cautions that most of the open source solutions, or proprietary vendor solutions, are not backed by time series storage technologies.
4) Data aggregation
Last but not least – data aggregation is, according to Martin, the most effective technique for taming data growth.
Martin explains a common pattern, or problem, among companies he talks to: “They are emitting a ton of data and the data has a lot of dimensions on it.” It makes sense, he explains, because you do want to slice-and-dice your data by those dimensions at a certain point.” Martin explains a scenario around latency and how you want to:
- See more than the overall latency of an application.
- Break it down by the latency between your various clusters, data centers, or regions and compare those.
- Break it down by the types of users that are requesting a particular service.
“There are many ways that you want to slice-and-dice data, because when something goes wrong, you want to pinpoint things like: What’s been affected? Or where is something going wrong?”
However, Martin notes, while dimensions, or pivots, offer huge advantages, they also produce a lot of additional data.
- Knowing the views of the data that you want is a good way to optimize for the outcome i.e. knowing you always want the sum of this data.
- Only store the final answer – it’s a much smaller subset of the data set and is a lot faster to read. “You’re just reading the pre-computed answer, as opposed to computing the answer each time you go and read the data.”
The takeaway, says Martin: Understand the use case of your data and optimize it accordingly.
As the data growth problem heads toward a tipping point, companies are going to need to take an outcome-driven approach to observability.