In 2013, a FinTech company was founded to offer a commission-free investing app that made trading so easy, it unleashed a new generation of investors who could buy and sell stocks without using or paying a broker. Before partnering with Chronosphere, the company ran observability in-house using Grafana Mimir. The solution was not only expensive, but was also plagued by availability issues and was prone to hours-long downtime episodes.
The challenge: Mimir backend caused missed SLAs in availability, reliability and performance
Over the past two years, the company has seen popularity of its trading platform skyrocket: Membership reached tens of millions and it saw an 80% increase in month-over-month usage. Their users, who execute trades throughout the day and tee them up the night before, are so passionate about freedom to buy and sell stocks that their expectations of uptime are the same as essential utilities – electricity, water, and the ability to trade should be always-on. “When money and regulatory bodies are involved, the reliability stakes are even higher – we needed to eliminate all barriers for customers to trade on our platform,” said a Senior Staff Engineer who also founded the observability practice.
For the company, the challenge became meeting reliability and performance demand from its rapidly growing user base. The company was paying millions of dollars per year on their open-source monitoring product with a Mimir backend. Yet the company was plagued by outages, especially at the “Sev1” level, meaning the production system has stopped operating and there is no workaround. There were several issues and outcomes related to outages:
- Monitoring system availability SLAs weren’t being met: The monitoring system was only achieving “one nine” of uptime (i.e., it was less than 99% available).
- It was harder to troubleshoot customer issues: Engineers were missing critical alerts and MTTD (Mean Time to Detection) was getting longer instead of shorter.
- Performance and retention needs weren’t being met: Queries and dashboard on data more than 12 hours old would fail due to timeouts. In addition, the company could only store 2 weeks of metrics data in Mimir, which put compliance with industry standards at risk.
- High operational cost: Mimir is composed of several different components, each of which require specialized hardware which struggled to keep up with their metrics volume, without significant investment in the infrastructure.
The resulting downtime was a huge disruption for enthusiastic day traders. “When Mimir went down, it was several hours before it came back up,” said the Senior Staff Engineer. “We can’t win over customer trust with a system that doesn’t offer high availability, durability and performance.”
The company required a solution that wasn’t overly-complicated, was cost-effective, and which would guarantee at least 99.9% uptime of observability services, and improved dashboard loading speeds. “Availability guarantees are essential. Cost was also a factor since we had been spending several millions on our previous monitoring product,” said the Senior Staff Engineer.
In search of a solution
The company began the search for a highly available solution that could scale alongside their business and was compatible with open source standards. Well-known SaaS application and infrastructure monitoring products had been evaluated and ruled out due to cost, reliability, performance, and vendor lock-in, especially where it concerned dependency on the vendor for custom integrations.
While the company briefly considered running observability in-house with a different tech stack, they ultimately decided SaaS was a better approach for the business. “SaaS frees up engineers from the on-call onus. You’re not playing ‘whack-a-mole’ with services and with underlying structure. You’re on the applications on top of your metrics, improving libraries, metrics adoption… it frees you up to focus on the bigger picture,” said the Senior Staff Engineer.
The FinTech company chose Chronosphere to solve reliability and performance requirements
Chronosphere quickly rose to the top of the list, with several capabilities hitting the company’s requirements head-on:
High availability and reliability. The Chronosphere observability platform was built from the ground-up for cloud-native scale and complexity, which means greater reliability. Chronosphere is 5X more reliable than alternative SaaS monitoring solutions and has never missed a customer SLA – a fact that was vitally important to the company given their previous challenges with availability.
Fast remediation. The faster the engineering teams know there is a problem, the faster they can start to remediate it. With Chronosphere, the company was able to reduce their MTTD issues by 4x, from 2 minutes to 30 seconds. Once the engineers are alerted, they can also load dashboards and reports much faster – dashboards that previously took 15 minutes to load, now load in seconds. On top of that, Chronosphere reduced “time to glass” – which describes the time from when a data point is generated to when it is visible in dashboards and reports – from 45 seconds to 5 seconds.
Compatible with open-source standards: Unlike other SaaS monitoring offerings, Chronosphere is open-source compliant, supporting all major open source metrics ingest protocols, dashboards, and query languages. Because Chronosphere is built on the open source M3 metrics engine, the company wouldn’t need to rely on the black box magic of monitoring a vendor’s proprietary data format, it avoided vendor lock-in, and it was able to leverage existing Prometheus investments.
Cloud-native observability expertise: Chronopshere’s co-founders Martin Mao (CEO) and Rob Skillington (CTO) previously ran the observability team at Uber where they experienced first-hand the challenges of running large-scale observability for cloud-native environments.
Ability to keep up with the business: With Chronosphere, the company found an observability partner who could keep up with their rapidly scaling business. With Chronosphere, the company is able to unlock new insights that were previously unavailable due to longer retention periods and faster load times. “After seven days, data isn’t usable. The fact that we now see data retention in excess of two years with Chronosphere is huge for us.”
- Improved availability to at least 99.9%
- Improved query latency by more than 8x
- Improved “time to glass” by 9x
- Improved MTTD by 4x
- Long-term retention increased from 2 weeks to 13 months
- Reduced engineering time spent on managing monitoring solutions by one third
- Improved scalability by more than 3X
- Eliminated all Sev 0 and Sev 1 incidents, representing 75% of total “critical” incidents