Snap, Inc. is a technology company that believes a camera lens is the most powerful way to improve how people share, communicate, and live. As a company that serves over 740 million daily customers across the globe with Snapchat, Spectacles, and Bitmoji, Snap supports consumers to share memorable experiences with friends in an online social network. 

Before partnering with Chronosphere, Snap’s customer base was rapidly scaling —  and their in-house, open source observability solution was becoming too expensive and time-consuming for the engineering team to manage.

Snap’s observability setup didn’t meet expectations:

  • Their legacy Graphite system faced stability issues and was failing. Each time the system crashed, engineers had to manually step in and bring systems back online. 
  • Their system had performance issues: dashboards and queries were slow to load, and it was difficult to respond to customer-facing issues.
  • The observability system couldn’t keep up with the amount of daily ingested data. Snap could only ingest and retain a fraction of the necessary metrics for troubleshooting.
  • Their self-managed Graphite instances were difficult to use, time-consuming to manage, and negatively affected developer experience. 
  • As much as 80-95% of metrics came from real-time services deployed via Amazon Elastic Container Service (ECS). 
  • Their Prometheus instance ran on Amazon EC2 R5 (r5.24x large with 768 GB of memory), one of the most expensive and memory-intensive instance types.

Before and after partnering with Chronosphere

The challenge: operational load and developer burnout

With their M3DB legacy system, it was nearly impossible for Snap’s engineering and IT teams to achieve their desired availability levels. Challenges that arose include: 

  • Availability: Random latency issues where the system would slow and dashboards wouldn’t load. 
  • Scalability: Anything more than small emitted metrics volumes and the system couldn’t keep up. 
  • Usability: Legacy system adoption is hard for new users. This affected mean time to repair (MTTR) and triage efforts. 
  • Operational load: On-call engineers received on average 30-40 pages per week. 

Alongside a system prone to crashing, the team at Snap was dealing with rising costs and impacted developer productivity — which held back Snap’s digital transformation initiatives. So, the team set out for a new, purpose-built observability solution that could keep up as they scaled, and would protect their developer experience.

The solution: improved developer productivity and cost efficiency

Evan Yin, Technical Lead at Snap said that he discovered Chronosphere’s technology while it was still a GitHub page for M3.

After experiencing Chronosphere’s ability to help teams gain back control over their observability data volumes, and obtain high availability levels at scale while suited for cloud native architectures, Yin knew that had met their fit. He also felt that Snap’s observability values aligned well with Chronosphere’s company vision. 

Chronosphere’s cloud native observability platform provided Snap with:

  • Reduced costs and hours spent: With Chronosphere’s Control Plane, the team reduced data volumes by more than half and saved thousands of engineering hours.  
  • Support for cloud native architecture at scale: Snap was able to scale their observability solution 5x, from 50 million time series to 250+ million time series. 
  • Improved observability reliability: With Chronosphere’s industry-leading 99.9% uptime service-level agreement (SLA), Snap engineers could always access needed dashboards and metrics. 
  • Developer productivity: With faster loading dashboards and queries, less time is spent worrying about metrics and  more time working on value-added tools. 

Outcome and key results: smoother operations and higher reliability. 

After partnering with Chronosphere, the engineering team at Snap gained the following  key results: 

  • Greater uptime of the observability stack and consistently meeting SLA since implementation.  
  • Reduced data volumes by more than 50%, leading to significant cost savings. 
  • Reliable dashboard loading and view of high-cardinality metrics 
  • A 90% decrease in on-call pages for the central observability team.
  • 89% data optimization (cost deduction and performance) 
  • Successful cAdvisor deployment
  • Tenant ability of 4×9

By partnering with a purpose-built, future-proof observability solution, the engineering team at Snap is supported in the way that they want and need to work. 

Experience the difference: Learn how Chronosphere’s solution can benefit you

If you’re curious about learning more about how the move to Chronosphere can help engineering teams improve their developer experience and successfully scale cloud native environments, check out the full Snap case study