ACF Image Customer Stories

Tecton: A Rapidly Scaling Startup Outgrows Prometheus

Tecton is a startup that provides an enterprise-grade feature store for machine learning applications. When the product first launched, the team relied on an out-of-the-box open-source Prometheus setup to monitor the fleet of Tecton instances, some of which are deployed directly in the customer account.

The logo for tecton on a dark background, celebrating its happy second birthday.

The Challenge

Each instance was completely siloed. With the Prometheus setup, it was impossible to get a global view of the entire fleet of Tecton instances. This was the underlying cause of a number of frustrations. First of all, during incidents the on-call engineer would have to toggle between isolated Prometheus deployments, increasing the amount of time it took them to uncover the metrics needed to address the issue. In addition, any change to Prometheus had to be made to each deployment. When the Tecton team wanted to silence a noisy alert, they would have to go through every Prometheus instance and manually make the change.

Looking for an alternative

The Tecton team had three criteria for a new monitoring solution. They were:

  • A global view across regions and customers
  • Control over data retention and the ability to keep historical data
  • High availability

Tecton did proof of concepts with a couple monitoring companies, and chose Chronosphere for two reasons. First of all, the Chronosphere team had a clear commitment to support. “Any of the alternatives we evaluated, I just didn’t have the confidence that we would have the support needed,” Trivedi said. Second, though, was price: Chronosphere was significantly more cost effective than the alternatives.

Success with Chronosphere

The biggest benefit of moving to Chronosphere is having an out-of-the-box monitoring solution that doesn’t break all the time and has a full feature set that doesn’t need custom engineering work. “We basically don’t think about monitoring anymore as we spin Tecton deployments up and down,” Trivedi said. The days of spending hours urgently debugging the monitoring system are gone, leaving Trivedi and the rest of the team with more time to focus on Tecton’s core product.

Less burdensome on-call rotations. Using Chronosphere has made on-call rotations less of a burden in a couple ways. First of all, there is a single point of entry into the monitoring system, which reduces the amount of time it takes to get visibility during an incident and figure out what is happening. Second, the ability to apply silences across the entire fleet of instances has made it easier to tune the alerting system and reduce the amount of noisy alerts, generally reducing alert fatigue.

Useful data retention. Now that Tecton has more control over data retention — as well as the ability to mix data from multiple places — Tecton has been able to create custom internal dashboards that pull data from the Chronosphere API and track business intelligence metrics. That wouldn’t be possible to do with Prometheus.

“It’s way better than it was before,” Trivedi said. Engineers are happier, better equipped to handle their on-call rotations and don’t have to waste time building custom monitoring features just to get the visibility they need.

Share This: