You might have heard the discussions lately about the “three phases of observability.” But what do they really mean? Taking a page out of the “explain it like I’m five” book, I’ll break down the three phases of observability into an explanation anyone can understand, including a five year old.
Your house is on fire
Imagine your house is on fire. Yes, ok, this is scary, but stay with me. Obviously, your primary goal is to put the fire out as quickly as possible. The order of events typically goes like this:
- Know: First of all, you need to know there is a fire! Maybe the fire alarm is going off, maybe you smell smoke, maybe you see the flames. Something alerts you to the fact that something is not right. The faster that happens, the sooner you can take action.
- Triage: The next step is figuring out how bad this situation is. Is it a big fire or a small one? How many rooms is it in? Can you put the fire out yourself or do you need to call the fire department?
By this point you’re likely taking steps to put the fire out, or remediate it — either you’ve called the fire department, you’re spraying the fire extinguisher, or pouring baking soda on the flames.
After the fire is out and everyone is safe, you can take a breath, assess the damage and try to determine what caused the fire and understand the root cause. How did this happen? Was your five year old trying to cook something? How can we make sure this doesn’t happen again? These are all critical questions to answer, but they are best done after you’ve remediated the problem. While there is a fire burning in my kitchen, the last thing I’m going to do is lecture my kid on cooking safety. That will come later.
In short, the goal is to get from the fire starting to putting out the fire as quickly as possible, while also running through the steps of know > triage > understand.
Bringing the analogy to your applications and infrastructure
Although the personal stakes are much lower, this is the same process an on-call engineer goes through when something goes wrong with their app:
- First they need to know about it, as quickly as possible. They need to know what happened and when and what systems were impacted. Ideally they find out BEFORE a customer is impacted.
- Then they need to triage and understand the impact and who can help. They need to figure out how many customers are impacted and to what degree. From there they can determine which teams need to get involved and what priority level it is.
- Lastly they need to understand the root cause of the problem. They need to understand how many different services were impacted and which services were NOT impacted. They’ll want to find out what steps they can take to make sure it doesn’t happen again.
During this entire process they are looking for a way to remediate the problem as fast as possible (i.e., put out the fire). Ideally you’re doing this last stage — understand — after the fire is out! If you can go straight from knowing about a problem to remediating it, that’s ideal. That takes the pressure off the next steps of triage and understanding. A lot of the time, it’s only after the triage stage that the issue can be remediated, and in a rare number of cases, remediation happens during/after the understand phase.
Wait, what about metrics, logs, and traces?
When discussing observability, many people immediately jump to metrics, logs, and traces. These are still incredibly important — they are data inputs to be used throughout the phases. The three phases focuses on outcomes and processes (vs tools and data sets)
Think of metrics, logs, and traces as powering your smoke alarm, your fire extinguisher, or the emergency phone call. They are a means to an end, but not the end in and of themselves.
Explain Chronosphere like I’m five
Chronosphere is a SaaS cloud monitoring tool that helps teams rapidly navigate the three phases of observability. We’re focused on giving devops teams, SREs, and central observability teams all the tools they need to know about problems sooner, the context to triage them faster, and insights to get to the root cause more efficiently. One of the major things that makes us different is that we were built from the ground-up to support the scale and speed of cloud-native systems. We also embrace open source standards, so developers can use the language and formats they’re already familiar with, and it prevents lock-in.
Learn more about Chronosphere and the three phases of observability here.