Abnormal Security cuts observability costs with Chronosphere

A blue background adorned with gears, showcasing an innovative blend of Chronosphere and observability costs.
ACF Image Blog

AI-based threat detection company Abnormal Security needed a solution to help with their growing observability costs. Read how Chronosphere helped.

A green circle with a black hole in the middle.
Chronosphere Staff | Chronosphere
3 MINS READ

Abnormal Security is on a mission to protect organizations from sophisticated email attacks, especially those targeting enterprise legacy systems. The company’s AI-based threat detection engine distinguishes between normal and abnormal behavior to autonomously prevent personalized, socially-engineered email attacks.

Abnormal’s customer base had grown rapidly since its 2018 launch, and so had its metrics. When Abnormal partnered with Chronosphere, its homegrown Prometheus + Grafana monitoring system, which was responsible for scraping all endpoints and consolidating all data, had reached its limits.

Abnormal’s growth in data metrics

  • The company’s 10-12 million active metrics were on pace to soar to 50 million.
  • As much as 80%-95% of metrics came from real-time services deployed via Amazon Elastic Container Service (ECS).
  • The Prometheus instance itself ran on Amazon EC2 R5 (r5.24x large with 768 GiB of memory), one of the most expensive and memory-intensive instance types.

The challenge – data growth and outages

Abnormal was drowning in a sea of observability data and constant metric outages. The effects of this unmanageable growth included:

  • Prometheus wasn’t highly available due to vertical vs. horizontal scaling. Any disruption with the EC2 instance caused multiple downstream issues.
  • Increased Mean Time to Detection (MTTD) of critical issues.
  • Limited retention period of two days due to both the administration and storage cost for Prometheus.
  • Slow to load dashboards – time series greater than 30 minutes or wouldn’t load at all.
  • Engineers accidentally caused Prometheus to crash by deploying new services or adding new times series—meaning the team was flying blind and scrambling to troubleshoot the cause.

The infrastructure team battled constant metrics outages alongside resource limitations, which impacted triaging and management. Abnormal set out on their search for an observability solution that could keep up with over 300% business growth experienced over the past year, while achieving targeted 99.9% SLA.

Solution – cost savings a priority

Achieving cost savings – engineering and infrastructure – was a key driver behind why Abnormal chose Chronosphere for observability. The company had already ruled out several monitoring alternatives, including running Thanos themselves in-house or another SaaS solutions like Grafana Labs.

Chronosphere’s cloud native observability platform provided Abnormal with:

  • Visibility into, and control over, how the observability system is being used.
  • Flexibility in metric retention by offering the ability to choose both the time interval and retention time.
  • Data aggregation – Chronosphere’s unique control plane allowed Abnormal to aggregate 98% of their metrics, which resulted in it being 10x more cost-effective than alternative SaaS and self-managed options.
  • Reduced need for management overhead — engineers and admins were freed up to work on problems that drive their business.
  • Open source compatible — Chronosphere can natively ingest Prometheus so it wouldn’t have to change any instrumentation.

Outcome and key results — cost reduction, availability, reliability

By adopting a scalable, flexible observability platform that helped cut observability costs, the Abnormal team gained monumental value for their team, infrastructure, and customer experience:

  • Improved overall Prometheus stability.
  • Clear predictable scale of resources, using the Chronosphere collector.
  • Reduced MTTD and MTTR by 80% based on SLOs.
  • Improved query performance — Abnormal is able to load dashboards 8-10x faster.
  • Improved reliability and stability to greater than 99.9% uptime.
  • Increased ability to understand problems that can be addressed by high cardinality metrics.
  • Empowered the SRE team to shift focus from monitoring to tackling problems that move the needle for the business.

Experience the Difference: Learn How Our Solution Can Benefit You

You can explore more on how making the move to Chronosphere has alleviated critical pain points for Abnormal’s engineering team in this Abnormal case study.

Share This:
Table Of Contents

Ready to see it in action?

Request a demo for an in depth walk through of the platform!