Cloud native architectures are churning out more data, increasing the cost of monitoring tools like Datadog. But there are better ways to manage these expenses.
On: Feb 20, 2024
Rachel leads Product & Solution Marketing for Chronosphere. Previously, she built out product, technical, and channel marketing at CloudHealth (acquired by VMware). Prior to that she led product marketing for AWS and cloud-integrated storage at NetApp and also spent time as an analyst at Forrester Research covering resiliency, backup, and cloud. Outside of work, she tries to keep up with her young son and hyper-active dog, and when she has time, enjoys crafting and eating out at local restaurants in Boston.
I’ve seen so many X (formerly known as Twitter), Reddit, and HackerNews threads lately discussing the high costs of Datadog. It’s such a hot topic that engineers are posting blogs about their approaches to brute-force drop metrics.
But how did we get here? Why are these costs so high? Why are companies paying more for their observability than their production infrastructure? There is a lot of finger-pointing and claims of lock-in and corporate greed, which are certainly partly to blame.
There is a bigger underlying issue: The fundamental architecture changes that come with adopting containerized infrastructure and microservices applications. If we don’t understand and address this issue, history will repeat itself.
OK, it’s true, I work for Chronosphere, a company that competes with Datadog. I promise this article will not pitch you on our product. Datadog is a strong competitor, and I’ve watched it build an amazing business for years.
My previous company was a close Datadog partner from 2015-2018, and we watched its meteoric growth, which we desperately wanted to emulate. At the same time, I watched Datadog customers get more and more disgruntled with skyrocketing and unpredictable costs, yet they felt they couldn’t leave.
This was part of what drove me to join Chronosphere in 2021, as I saw this trend coming to a head. Before I joined this space, I did some market sizing and analysis and determined that observability had the biggest attachment to infrastructure spend: For every $1 you spend on public cloud, you’re likely spending 25-35 cents on observability. This struck me as a market ripe for disruption.
The root cause of the problem is simple: There is a lot more observability data (metrics, logs, traces, and events) than these tools ever predicted. As such, they are not architected for this data volume nor priced accordingly. There are multiple reasons we ended up with so much data.
Business drivers:
Technical drivers:
This data growth causes observability spending to skyrocket. Without changing pricing models or software to account for data growth – and keeping pricing based on legacy monitoring standards – cloud native architectures suddenly became shockingly expensive to run.
I suspect there are two reasons for this:
There are a couple of options if you don’t want to pay for Datadog.
One attractive alternative is running your own observability in house with open source tools. The good news is that, at least for metrics and traces, open source tools have come a long way and are coalescing into industry-accepted standards. Prometheus and OpenTelemetry with a variety of time series database backends (Mimir, Thanos, and M3) are a viable alternative to Datadog.
But it’s important to note this typically won’t save you money in real dollars. It’s simply trading CapEx for OpEx. The human and infrastructure cost of running these systems is non-trivial, and if you try to cut corners, you may regret it.
I was talking to a friend recently who moved his company off an expensive commercial SaaS offering to in-house open source tools. He admitted that the company isn’t actually saving any money when it accounts for the fact that around 8% of his developer headcount is now dedicated to running this system.
This is not the part where I pitch Chronosphere. This is where I’ll say tools are being built with the underlying assumption of data growth from the start. The cost of the solution is always in the hands of the customer, so you don’t get surprise overages.
Just as Datadog and New Relic and similar tools displaced the previous generation of Solarwinds and BMC and CA Technologies, this new generation of observability tooling is starting to make waves. Talk with these vendors and understand how they are handling the problem of too much observability data from the source versus bandaging over it with better unit economics.
Datadog’s high cost and vendor lock-in have somehow become a necessary evil; you know you need observability, but you’re not sure of all the options. Datadog has been around long enough that it seems like a viable option, despite its billing practices and proprietary code. But it doesn’t have to be this way.
As more observability companies enter the space, so do options that are built to address high cardinality data growth from the beginning. Ones that give you more flexibility with your infrastructure, greater control of your data and more visibility into your monthly bill, and ultimately set observability teams up for a more sustainable and cost-effective operations model.