Prometheus: Availability and Reliability

A person writing a checklist on a blue background.
ACF Image Blog

We explore Prometheus high availability and reliability as the first of four key requirements for a Prometheus-native solution.

 

Chronosphere logo
Chronosphere staff Chronosphere
5 MINS READ

The world of monitoring has fundamentally changed. Today’s monitoring tools were not designed for the complex, dynamic, and interconnected nature of cloud-native architecture. Companies need a monitoring solution that is as scalable, reliable, and flexible as the cloud-native apps they need to monitor. The need a Prometheus solution with high availability.

Prometheus’ single binary implementation for ingestion, storage, and querying makes it ideal as a light-weight metrics and monitoring solution with quick time to value — perfect for cloud-native environments. But simplicity and ease has its trade-offs: As organizations inevitably scale up their infrastructure footprint and number of microservices, you need to stand up multiple Prometheus instances. This means you’ll start to deal with high availability and data locality issues – all of which require significant management overhead.

Why Prometheus-native and what does it mean?

anizations are exploring moving to a hosted, or a managed, metrics and monitoring SaaS offering. To ensure a smooth transition to the solution and avoid future lock-in, it’s critical that the SaaS monitoring solution be fully Prometheus-native. There are specific capabilities you should look for in a Prometheus-native monitoring solution, including:

  • Prometheus ingestion protocol support.
  • 100% PromQL compatibility.
  • Prometheus Alert Manager definition support.
  • Prometheus Recording Rule support.
  • Grafana dashboard support.

With our working definition of Prometheus-native basics in hand, we can dive into a discussion about requirements for the monitoring solution itself.

Prometheus-native monitoring: 4 must-haves

This four-part blog series explores the key capabilities that organizations should consider when selecting a Prometheus-native monitoring solution, covering:

  1. Prometheus High Availability and Resiliency (current blog)
  2. Cost and Control
  3. Security and Administration
  4. Performance and Scale

Prometheus high availability and resiliency requirements

Let’s start by talking about a primary reason you’re investing in cloud monitoring in the first place: ensuring performance of your cloud-native apps. In order to ensure your infrastructure, applications, and services are performing optimally, you must have highly available and extremely resilient monitoring. For this reason, it’s important that your Prometheus monitoring has higher availability than even your critical production apps. Here are some important requirements you should consider.

Prometheus high availability and reliability

Under the strain of increased data volume and cardinality, monitoring systems become unreliable: They lose data or experience downtime. Without monitoring, teams are flying blind and won’t be able to respond to issues in real time. To make sure this doesn’t happen to you, pay close attention to the Prometheus high availability and reliability of your SaaS vendor.

Service level agreements (SLAs)

When discussing SLAs, most people jump straight to “How many nines of uptime does the vendor offer?” This is a critical question as you’ll want to have a monitoring solution that is more highly available than your production environment, meaning ideally you’ll be looking for 99.9% uptime.

Beyond the actual number itself, it’s important to look at how the vendor defines and monitors the SLA and at what point they notify customers. A best-in-class solution will proactively monitor their own systems for downtime and count any period greater than a few minutes of the system not being accessible as downtime and immediately notify customers.

SLA checking 

Proactive SLA checking is a great start, but a simple ping check against an endpoint does not tell you much about whether the system is performing as expected — it only tells you that it is returning 200s successfully. A proper SLA check for a hosted monitoring solution should check the basic read and write paths to ensure that data is persisting as expected and no data is lost.

Dedicated endpoint 

One of the leading causes of downtime in SaaS monitoring solutions is due to noisy neighbors. This is when another tenant in a multi-tenanted SaaS environment impacts performance of other tenants. To avoid this, you would ideally want a dedicated environment with dedicated endpoints that are not shared with other customers.

Cloud provider choice and circular dependency protection

Most engineers are familiar with the concept of a circular dependency — when A depends on B, but B also depends on A — but most people don’t think about this in the context of their SaaS monitoring service and production.

If your production applications are hosted in the same cloud provider and region as your SaaS monitoring solution, you’ve got a circular dependency. The primary risk is that if there is a service disruption in that region, you could have a simultaneous outage of both your production environment and your monitoring solution. This is one of the worst possible times to incur an outage to your monitoring solution as it is the system that is meant to inform you of your production outage. Best practice is to host your SaaS solution in a different region, AND with a different IaaS cloud provider, if possible, to avoid this circular dependency.

To learn more, read the full ebook, Cloud Native Observability Practical Challenges and Solutions for Modern Architectures.

Share This: