In 2013, a FinTech company was founded with one mission in mind: To offer a commission-free investing app that democratized trading and finance. The company unleashed a new generation of investors who could buy and sell stocks – all without a broker.
Prior to partnering with Chronosphere, the company ran in-house observability with Grafana Mimir. The company’s team experienced high costs, was plagued by availability issues, and prone to downtime episodes that were sometimes hours long.
The challenge: availability, reliability, and performance
The high-growth FinTech company saw sky-rocketing popularity of its trading platform over the course of two years. Membership reached tens of millions, and the company saw an increased over-month usage of up to 80%.
Because of their rapidly growing user base, the biggest challenge for the organization became meeting reliability and performance demands.
Their setup’s issues included:
- Missing monitoring system availability SLAs
- The monitoring system was only achieving “one nine” of uptime
- It was harder to troubleshoot customer issues.
- Engineers were missing critical alerts.
- Mean time to detection (MTTD) was getting longer instead of shorter.
- Lacking performance and retention needs
- Queries and dashboards on data that were more than 12 hours long failed due to timeouts.
- The company could only store 2 weeks of metrics data in Mimir – putting compliance with industry standards at risk.
- Accruing high operational costs
- Mimir’s components required specialized hardware that struggled to keep up with metrics volume.
“When Mimir went down, it was several hours before it came back up … We can’t win over customer trust with a system that doesn’t offer high availability, durability, and performance.” – Senior Staff Engineer
Downtime became a customer-facing disruption for enthusiastic day traders. Despite millions of dollars spent per year on their open-source monitoring product with a Mimir backend, the team was still plagued by major outages — at the Sev1 level.
The search for a new, reliable solution
The company needed a solution that was cost-effective, easy to use, and would guarantee at least 99.9% uptime of observability services — alongside improved dashboard loading speeds, and open source standard compatibility.
While the company considered running observability in-house with a different tech stack, they ultimately decided that SaaS was a better business approach. Throughout the search for a new solution, they ruled out well-known SaaS application and infrastructure monitoring products w due to issues with cost, reliability, performance, and vendor lock-in.
Solve reliability and performance requirements with Chronosphere
The more the company learned about Chronosphere, they found several capabilities that met their requirements:
- High availability and reliability
- Chronosphere’s observability platform was built for cloud native scale and complexity.
- Chronosphere is 5X more reliable than alternative SaaS monitoring solutions.
- Fast remediation
- With Chronosphere, the company reduced their MTTD issues by 4x, from 2 minutes to 30 seconds.
- Chronosphere reduced “time to glass,” (the time from when a data point is generated, to when it is visible in dashboards and reports) from 45 seconds to 5 seconds.
- Compatible with open source standards
- Chronosphere is open source compliant. The solution is built on the open source M3 metrics engine, so the company avoided vendor lock-in and leveraged existing Prometheus investments.
- Cloud native observability expertise
- Chronosphere’s co-founders previously ran the observability team at Uber, and experienced large-scale observability challenges first-hand, so they were aware of the challenges that came with rapidly scaling infrastructure.
With Chronosphere, the company found an observability partner who could keep up with their rapidly scaling business. The company’s observability team gained the ability to unlock new insights and examine data retention that was normally unavailable.
Chronosphere’s cloud native observability platform provided the FinTech company with:
- Improved availability to at least 99.9%
- Query latency improvement by more than 8x
- 9x improvement on “time to glass”
- 4x improvement of MTTD
- Long-term data retention increase from 2 weeks to 13 months
- Reduced engineering time spent managing monitoring solutions by 33%
- More than 3x improved scalability
- Eliminated all Sev 0 and Sev 1 incidents, representing 75% of total critical incidents.
Now, the FinTech company no longer has to worry about critical outages and hours-long downtime that reaches the customer. The team has the power and support to proactively solve problems with rapid turn-around time.