Not all monitoring solutions are created equal, so it’s important to know what to look for as you evaluate different approaches to monitoring cloud-native apps. In this series, we explore why a Prometheus-native monitoring solution is best, what qualifies as Prometheus-native, and we define the four key capabilities your monitoring-solutions vendor should offer. We break it all down, covering:
- Availability and Resiliency
- Cost & Control
- Security & Administration
- Performance & Scale (current blog)
So far we’ve discussed availability & reliability, cost & control, and security & administration. We also define the meaning of Prometheus-native. In this fourth and final installment, we explore which performance & scale features are must-haves in your next Prometheus-native monitoring solution..
Real-time monitoring means performance and scale
It’s critical that your monitoring solution can scale reliably as you transition to cloud-native and continue to achieve high levels of performance. Under the strain of increased metrics data volume, cardinality, and scaling workloads, some monitoring systems struggle to ingest all the data and make it available in real-time. Without real-time monitoring, teams might not find out about issues right away, which can lead to prolonged customer impact.
1) Proven ability to scale
Many organizations have found that as they scale their cloud-native environment, their SaaS monitoring vendor can’t keep up. To be sure you are picking a solution that meets your scale needs today and in the future, you’ll want to ask your vendor to provide examples of other customers of a similar size and growth rate, and ideally speak with these customers as well.
2) Speed of alerting
When something goes wrong, you want to know as soon as possible so you can begin to remediate the issue. In order to know something has gone wrong, you need to get an alert from your monitoring system. Not all SaaS vendors are created equal in this regard, so you’ll want to ask how frequently alerts are checked, with the goal of finding a solution that notifies users of an issue within a few seconds.
3) Alert generation and management
To get up and running faster, you’ll want to look for automated alert set-up. Since most services or hosts produce the same metrics data, automated or templatized alerting can help engineers know about problems without a complicated set-up process. Some tooling also offers the ability to group or correlate alerts so engineers don’t get overwhelmed by alert storms. Engineers need to quickly get the information they need to start remediating the issue.
Support and dependability: Going beyond product capabilities
Functional requirements are the core focus area for buyers. At the same time, services and support cannot be overlooked. This area can often be a game changer, as many organizations require strict SLAs (service level agreements) or need a dedicated team to help them ensure success.
Look for a vendor who is a trusted partner — one that can bring expertise to the table and has the skills to deal with the unexpected. Ask yourself questions that will tell you what services – beyond product capabilities – the vendor provides, such as:
- Can they on-board my historical metrics data?
- What metrics data formats do they support?
- Can they provide assistance setting up alerts and dashboards?
- Will they do enablement sessions with users and administrators?
Here are three powerful screening tips to rely on when evaluating a Prometheus-native monitoring vendor.
1) Does your vendor have a strong SLA track record?
Actions speak louder than words. It is more important to know about your SaaS vendor’s historical performance on SLAs than what they are promising for the future. Remember too, that SLA definitions vary company to company, so make sure you understand the full context of how they define an outage when you make side by side comparisons. You’ll want to ask not only for overall customer SLA performance, but also for the SLA performance of their biggest (and likely most demanding) customers.
|What is the vendor’s uptime guarantee?||99.9% guaranteed uptime ( 0.74 hrs / month of downtime) ||99.5% guaranteed uptime (3.6 hrs / month of downtime).|
|What is the vendor’s actual delivered SLA over the past 12 months||99.95% actual uptime over past 12 months||Do not disclose actual uptime over past 12 months|
|How does the vendor check that systems are available?||Checks that read and write paths are available to ensure that data is persisting as expected and no data is lost||Simple endpoint ping |
|Definition of downtime||Non-availability of the Chronosphere Covered Services for a duration of more than 2 minutes||The time stamp of the alert in the vendor’s monitoring systems; or the time stamp of the customer’s ticket |
2) Prove it: Talk live with customer
As with any vendor-selection process, getting customer references is a key part of making a decision. Talking live to existing customers is critical, rather than relying on customer stories. If you have any hesitations about the vendor, speaking directly with the customer reference can be a good way to either solidify, or dismiss, these concerns.
3) The bake-off
After you’ve completed a paper evaluation, the next step is to do a bake-off: Most vendors offer a free or paid pilot where you can test their solution’s capabilities and make sure they will meet your needs. Make sure you go into the pilot with a clear set of success criteria and a plan for how you will put the product through its paces.
Selecting the right Prometheus solution
As you go through this process, you may come to the conclusion that a pure managed Prometheus offering doesn’t actually meet your criteria— instead what you need is a Prometheus-native SaaS solution. Even though they are SaaS offerings, several of the solutions on the market are not complete end-to-end offerings. For example, in some cases, you are still responsible for running your own instance of Grafana for dashboarding and visualization and Prometheus Alert Manager for alerts. Other solutions also force you to continue running Prometheus collection instances in your own environment.
This additional management overhead can ultimately prove to be very time consuming and expensive for organizations as well as lead to other significant challenges like single points of failure and lower availability/reliability. That’s why, in many cases, it makes more sense to work with a SaaS solution that is Prometheus-native, instead of pure managed Prometheus (i.e. has full compatibility with Prometheus as listed earlier, but doesn’t have some of the limitations that come with pure managed Prometheus). That will eliminate the need for additional tooling you run yourself.
Why choose Chronosphere
Chronosphere is the only SaaS monitoring solution built for cloud-native, providing deep insights into every layer of your stack – from the infrastructure to the applications to the business. Chronosphere is open-source compliant, and fully supports Prometheus metrics ingest protocols, dashboards, and query languages.
With Chronosphere, not only can teams avoid lock-in, but they can also leverage their existing Prometheus and Grafana investments. Additionally, Chronosphere supports older generations of metrics protocols (Graphite, StatsD, etc), which means it will support your entire environment, even as you migrate off older formats.
Other blogs in this series: