Why events are the critical telemetry type you’re missing

Why Events are the Critical Telemetry Type You’re Missing
ACF Image Blog

Familiar with logs, metrics, and traces but new to events? This blog explores why events are so critical to observability and how they can help.

Rachel Dines
Rachel Dines Rachel Dines Head of Product & Solution Marketing | Chronosphere

Rachel leads Product & Solution Marketing for Chronosphere. Previously, she built out product, technical, and channel marketing at CloudHealth (acquired by VMware). Prior to that she led product marketing for AWS and cloud-integrated storage at NetApp and also spent time as an analyst at Forrester Research covering resiliency, backup, and cloud. Outside of work, she tries to keep up with her young son and hyper-active dog, and when she has time, enjoys crafting and eating out at local restaurants in Boston.

5 MINS READ

Give events a chance; knowing what changed is essential to identifying and resolving problems.

In a meeting last year with a bunch of senior observability leaders from cloud native companies, I asked everyone to tell me their least favorite telemetry type: metrics, events, logs, traces, or whatever. I was pretty confident the dominant answer would be logs. Nothing against logs, but I had recently heard this group express the hot take “during an incident, if you’ve gone to the logs, you’ve already failed.”

I was wrong. To my surprise, they answered almost unanimously: events. Events were the most despised telemetry type. I followed up by asking, why do you dislike events so much? Again the answer was nearly unanimous: The lack of definition about what they are and how you can use them. 

I get it. In researching events, I’ve found four or five different definitions, and no one seems to have nailed down the best way to use them in a troubleshooting workflow.

Since that meeting, our team has spent a lot of time thinking about events and how we can make them useful as a first-class telemetry citizen. The team did extensive research and then got to work building a function to track change events. Just recently, we announced the ability to ingest events in our observability platform

I want to step back and explore why events are so critical and how they can help.

Events tell you what change caused an issue

Change is the leading cause of errors. In a steady state, a system should continue to operate consistently for an indefinite period of time. Unfortunately, in a modern DevOps environment, our systems change dozens of times a day. We ship new code, we turn on and off feature flags, we deploy new infrastructure, we scale it up and down, and we even change observability solutions. And business doesn’t stand still either; it’s in constant flux based on the time of day, day of the week, season of the year, world events, competition, and a million other factors we can’t track.

The only way to stay on top of change is to contextually link your systems, so when you get an alert, you can quickly see what occurred in the same time frame that might have introduced the breaking change. This is what we call an event.

 

Image Alt Text

What is an event anyway?

An event is a discrete change to a system, a workload, or an observability platform. Here are some examples of events and how they might help you troubleshoot an issue:

  • System change: These are the types of changes that most people think about when it comes to events. Examples might be an autoscaling action, a configuration change, or a feature flag. These changes can be found by digging into the relevant CI/CD, feature flag, or infrastructure management tools, but that takes precious time. 
  • Workload change: This is the most common blind spot for organizations.  Examples might be onboarding a new customer or a business event like a site-wide sale. Contextualizing your other telemetry data with these events can reduce unnecessary investigation and Slack chatter (time) when folks are trying to determine why their telemetry suddenly looks different when there were no relevant system changes.
  • Observability platform change: These events could be an alert firing or being muted. It could also be a new data aggregation rule taking effect that causes the shape of the data to change.
Image Alt Text

How do events fit in with other telemetry types?

Like an observability signal, events cannot stand alone. They play an important role in the troubleshooting workflow alongside metrics, traces and logs. While metrics can tell you the symptom of a problem and are the primary driver in mean time to detect (MTTD) results, events can quickly tell you what changed. Alongside tracing, which will help you find the location of the problem, events help you remediate and stop the customer pain. From there, you might dig into the logs to start understanding why the problem happened, so you can get to the root cause and fix the underlying issue. 

We call this workflow the three phases of observability: Know about an issue, triage it, then understand it, all while working towards remediation as quickly as possible.

 

Image Alt Text

See events in action

I originally called this piece “in defense of events,” and hopefully now you understand why and are open to giving them a chance. They complement and enhance your other telemetry types, hopefully making it faster to get critical context into your alerts.

 

Share This:
Table Of Contents

Subscribe to the Observability Insider