Building out an alert isn’t always a “set and forget” task. Sometimes requirements change over time and need to be revisited to add new information or incorporate specific schedule requirements. This might include tweaking a few parameters – or delving deep into PromQL code to get everything just right.
Chronosphere’s engineers deploy 100s of services running on Google Kubernetes Engine and ensure that these services operate smoothly when end users log into their Chronosphere tenant portal.
We regularly use an alert that tells us if different service versions are running in production (also known as version drift); having multiple service versions increases troubleshooting time, decreases testing accuracy, and causes an inconsistent user experience.
However, we recently moved our deployment schedule to twice a week frequency for most of our tenants. This meant from Tuesday to Thursday, we expected all services to have identical versions. But then from Thursday to the following Tuesday, we expected all services to have identical versions except for one tenant.
With this schedule, we started to receive several unactionable alerts from Thursday to Tuesday. It quickly became an alert noise issue and we discussed a change in the alert parameters.
I researched some possible solutions and realized I could use a PromQL query to absolutely ensure the alert would fire when we wanted and incorporate the specific deployment timing requirements.
Building the PromQL query foundation
By combining the actual query with time restriction – via PromQL and the Prom Operator – I was able to focus on the actual hours we considered version drift to be the problem. This approach would ensure that the team would only get pinged with an accurate alert that was actionable.
Before I started, I realized there were a few potential challenges with this query:
- Understanding what data we were collecting that could be used to inform us of the conditions we are interested in (in our case, version drift). I wanted to make sure I included all the necessary information so technical staff knew what was happening when an alert fired.
- Understanding how set operators work, and understanding how Prometheus functions work in order to effectively build the PromQL query. The more familiar I was with what functions I could use and the data they returned, the easier it would be to build the query.
- Having the timestamps stored as UTC (Coordinated Universal Time (UTC) made the query even more complicated because the typical North American workday spans multiple days in UTC, which requires more conditions to include all the necessary time zones.
However, I knew I’d be able to get to my desired query working on each individual concern – and lots of testing.
To validate my theory and understanding of the PromQL functions, I started with a very simple case: timestamp(up{})
which I expected to give me a Unix timestamp back (which it did). This is when I realized that the timestamps were returned in UTC. These timestamps would help me build out the time parameters for the overall query.
Next, I tested the hour
and day_of_week
functions to check that they gave me the expected value back: hour(timestamp(up{}))
.
After that, I added in the set operators, e.g. up{} and (hour(timestamp(up{})) == 1)
which I expected to give a result only between 1:00am-1:59am daily. I then substituted my real query to check the version drift into the up{}
metric and got to work on translating our schedule into UTC times.
The next step was to make sure my PromQL code would work in production and get a final product that would only fire alerts when version drift was actually an issue.
Peeling back the layers (and debugging)
In my experience, debugging your PromQL is always hard, as usually when I do something wrong, the answer is: nothing; I just get no value back. To debug this query, I did what I always do, which is to peel off each layer of complexity in the query until it is simplified down to something that I was confident should work.
I also used Chronosphere’s Query Builder and PromLens as part of this process; it breaks down each part of the query and helps me understand at which part my metrics disappear.
Initially, I started with this larger version of the query to see if I got any results back:
(day_of_week(timestamp(count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*"
}
)
) > 1)) == 2 and hour(timestamp(count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*"
}
)
) > 1)) >= 23)
And if that still gave me no/unexpected results, I’d strip it down further by testing each part of the and
condition separately:
day_of_week(timestamp(count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*"
}
)
) > 1)) == 2
hour(timestamp(count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*"
}
)
) > 1)) >= 23
And if that test still didn’t work, I’d go further again and break it down to:
timestamp(count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*"
}
)
After multiple rounds of testing and peeling back these code layers, I ended up with an extensive query that would cut down on false alerts if multiple versions were running on our production clusters:
# Tuesday 3pm - 4pm
(day_of_week(timestamp(count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*"
}
)
) > 1)) == 2 and hour(timestamp(count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*"
}
)
) > 1)) >= 23)
# Tuesday 4pm - Wednesday 4pm
or
(day_of_week(timestamp(count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*"
}
)
) > 1)) == 3)
# Wednesday 4pm - Thursday 8am
or
(day_of_week(timestamp(count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*"
}
)
) > 1)) == 4 and hour(timestamp(count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*"
}
)
) > 1)) < 16)
# Thursday to Tuesday, only the canary should be different
# Thursday 3pm - Thursday 4pm
or
(day_of_week(timestamp(count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*",
chronosphere_k8s_namespace!="canary"
}
)
) > 1)) == 4 and hour(timestamp(count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*",
chronosphere_k8s_namespace!="canary"
}
)
) > 1)) > 23)
# Thursday 4pm - Friday 4pm
or
(day_of_week(timestamp(count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*",
chronosphere_k8s_namespace!="canary"
}
)
) > 1)) == 5)
# Friday 4pm - Saturday 4pm
or
(day_of_week(timestamp(count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*",
chronosphere_k8s_namespace!="canary"
}
)
) > 1)) == 6)
# Saturday 4pm - Sunday 4pm
or
(day_of_week(timestamp(count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*",
chronosphere_k8s_namespace!="canary"
}
)
) > 1)) == 0)
# Sunday 4pm - Monday 8am
or
(day_of_week(timestamp(count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*",
chronosphere_k8s_namespace!="canary"
}
)
) > 1)) == 1 and hour(timestamp(count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*",
chronosphere_k8s_namespace!="canary"
}
)
) > 1)) < 16)
# Monday 4pm - Tuesday 8am
or
(day_of_week(timestamp(count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*",
chronosphere_k8s_namespace!="canary"
}
)
) > 1)) == 2 and hour(timestamp(count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*",
chronosphere_k8s_namespace!="canary"
}
)
) > 1)) < 16)
Applying project learnings
With the finished query in place, I had a query that could notify the engineering team when version drift is actually considered to be a problem and only received “true” alerts with data from the defined schedule – which meant less unactionable alerts to sift through.
Shortly after building out this PromQL query, my team and I realized it would become too complicated to address timeframes during daylight savings. This initial exercise still proved useful, however, as I could use the included components of the query to build out a robust scheduled alert that could address the same issue and would be easier to maintain going forward.
Because Chronosphere can support natively supports scheduled alerts, I rewrote the alert using that feature and ended up with a small set of much more succinct Terraform resources:
resource "chronosphere_monitor" "infra" {
name = "Different build version of services running in production"
query {
prometheus_expr = <<-EOT
count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*",
}
)
)
EOT
}
timezone = "PST"
# Begin evaluating after Tuesday's deployment should be done
range {
day = "Tuesday"
start = "15:00"
end = "24:00"
}
# Stop evaluating when Thursday's deployment will start
range {
day = "Wednesday"
start = "00:00"
end = "24:00"
}
range {
day = "Thursday"
start = "00:00"
end = "08:00"
}
}
# Series conditions, labels, groupings, etc.
}
resource "chronosphere_monitor" "infra" {
name = "Different build version of services running in production, excluding canary"
query {
prometheus_expr = <<-EOT
count by (service) (
count by (build_version, service) (
build_information{
chronosphere_k8s_cluster=~"production-.*",
chronosphere_k8s_namespace!~"canary"
}
)
)
EOT
}
schedule {
timezone = "PST"
# Begin evaluating after the Thursday deployment should be done
range {
day = "Thursday"
start = "15:00"
end = "24:00"
}
# Stop evaluating when Tuesday's deployment will start
range {
day = "Friday"
start = "00:00"
end = "24:00"
}
range {
day = "Saturday"
start = "00:00"
end = "24:00"
}
range {
day = "Sunday"
start = "00:00"
end = "24:00"
}
range {
day = "Monday"
start = "00:00"
end = "24:00"
}
range {
day = "Tuesday"
start = "00:00"
end = "08:00"
}
}
# Series conditions, labels, groupings, etc.
}
Additional resources
Interested in learning more about Prometheus, PromQL, and Chronosphere? Check out these resources: