Observability today: More than data collection

A graph displaying the observability of bush growth through data collection in a field.

Blog

On the Screaming in the Cloud podcast, Chronosphere’s Field CTO Ian Smith discusses three pillars of observability, the difficulties surrounding instrumentation, and the company’s innovations.

On: Feb 16, 2023

26 MINS READ

As observability evolves, more companies want to know how it can benefit their cloud native infrastructure — but many still have questions about how observability differs from legacy monitoring, pricing, and its implementation strategy.

Chronosphere Field CTO Ian Smith joined The Duckbill Group’s Chief Cloud Economist, Corey Quinn, on an episode of his Screaming in the Cloud podcast, “The Ever-Changing World of Cloud Native Observability.” The two discussed the three pillars of observability, difficulties surrounding instrumentation, and Chronosphere’s innovations in the space.

For those who don’t have time to listen to the full episode, we’ve laid much of their conversation below:

Getting level-set on observability basics

Corey: DoorDash had a problem. As their cloud native environment scaled and developers delivered new features, their monitoring system kept breaking down. In an organization where data is used to make better decisions about technology, and about the business, losing observability means the entire company loses their competitive edge.

With Chronosphere, DoorDash is no longer losing visibility into their application suite. The key? Chronosphere is an open source-compatible, scalable, and reliable observability solution that gives the observability lead at DoorDash business confidence, and peace of mind.

Corey: Every once in a while, I find that something I’m working on aligns perfectly with a person that I wind up basically convincing to appear on this show. Today’s promoted guest is Ian Smith, who is Field CTO at Chronosphere.

Ian: Thanks, Corey; great to be here.

Looking out for observability solution imposters

Corey: Observability is one of those areas that I think is suffering from too many definitions. At first, I couldn’t make sense of what people actually meant when they said observability. This sort of clarified to me, at least when I realized that there were an awful lot of, well, let’s call them legacy monitoring companies, that just chose to take what they were already doing, and define that as observability.

To my understanding, [Ian] you were at interesting places such as Lightstep, New Relic, Wavefront, and PagerDuty; which I guess technically might count as observability in a very strange way. How do you view observability?

Ian: A lot of definitions, as you’ve said, common ones, talk about the three pillars. And they talk about data types. For me, it’s about outcomes. I think observability is really this transition from the yesteryear of monitoring, where things were much simpler and you sort of knew all of the questions. You were able to find your dashboards, you were able to define your alerts, and that was really the gist of it. Going into sort of a brave new world where there’s a lot of unknown things, you’re having to ask a lot of unique questions, particularly during a particular instance. Being able to ask those questions in an ad hoc fashion would traditionally be done with monitoring. Observability is that more flexible, more dynamic kind of environment that you have to deal with.

Corey: Back when I was running production environments, things tended to be a lot more static where “There’s a problem with the database”. I will [secure shell] into the database server. Or, “We’re having a weird problem with the web tier. Well, there are 10 or 20 or 200 web servers… great, I could aggregate all of their logs to CIS log, but worst case, I can log in and poke around.”

With a more ephemeral style of environment where you have Kubernetes, scheduling containers into place that have problems, you can’t attach to a running container very easily, and by the time you see an error, that container hasn’t existed for three hours. That becomes a problem.

Ian: Yeah, I think there’s that and also the added complexity oftentimes you see, before phones or behavioral changes based on even more narrow pathways, right? One particular user is having a problem and their traffic is spread across many containers isn’t [going to make] all of these containers perform badly. But their user experience is being affected.

More (data) is not always better: Cost implications and the three pillars

Ian: It’s very common in, say B2B scenarios, for you to want to understand the experience of one particular user or what, or the aggregate experience of users at a particular company, a particular customer, for example. There’s just more complexity: There’s more complexity of the infrastructure and just the technical layer that you’re talking about. But, there’s also more complexity in just the way that we’re handling use cases and trying to provide value with all of this software to the myriad of customers and different industries that software now serves.

Corey: From where I sit, I tend to have a little bit of trouble disambiguating the three baseline data types that I see talked about again and again in observability. You have logs, which I mostly can wrap my head around. That seems to be the baseline story of “Oh, great, your application puts out logs. Of course it’s in its own unique beautiful format. Why wouldn’t it be?” In an ideal scenario, they’re structured. You’re basically tailing log files in some cases. I can reason about those.

Metrics always seem to be a little bit of a step beyond that. It’s “I have a whole bunch of log lines that are spitting out every 500 error that my app is throwing,” and given my terrible code, it throws a lot, but I can then ideally count the number of times that appears and then that winds up incrementing counters similar to the way that we used to see with StatsD for example. Is that directionally correct as far as the way I reason about logs and metrics?

Ian: I think at a really basic level, yes. I think that as we’ve been talking about how greater complexity starts coming in when you have metrics in today’s world of containers, Prometheus has become the standard for expressing those things. You get situations where you have incredibly high cardinality — cardinality being the interplay between all the different dimensions.

So [let’s say] my container is a label, but also the type of endpoint that is running on that container is a label. Then, maybe I want to track my customer organizations and have 5,000 of those. I have 3,000 containers. And you get this massive explosion, almost multiplicatively. For those in the audience who really live and breathe cardinality, there is probably someone screaming “Well, it’s not truly multiplicative in every sense of the word,” but you know, it’s close enough from an approximation standpoint.

You [then] get this massive explosion of data, which obviously has a cost implication, but also has a really big implication on the core reason why you have metrics in the first place, which is so that a human being can reason about it. You don’t want to go and look at 5,000 log lines. You want to know: “Out of those 5,000 log lines, I have 4,000 errors and I have 1,000 OKs.” It’s very easy for a human being to reason about that, from a numbers perspective. When your metrics start to re-export out into thousands, millions of data points and unique sort of time series or numbers for you to track, then you’re sort of losing that original goal of metrics.

Corey: Then that brings us to traces. The difference between a trace and logs tends to get very muddled for me. But the idea being that as you have a customer session or a request that talks to different microservices, how do you collate, across different systems, all of the outputs of that request into a single place so you can see timing information and understand the flow that the user took through your application. Is that directionally correct? Have I completely missed the plot here?

Ian: No, that is sort of the fundamental premise or expected value of tracing. We have something that’s akin to a set of logs. They have a common identifier — a trace ID — that tells us that all of these logs essentially belong to the same request, but importantly there’s relationship information.

This is the difference between just having logs with just a trace ID attached to them. For example, if you have service A calling service B and service C, the relatively simple thing you could do is to use time to try to figure this out. But, what if there are things happening in service B at the same time, or happening in service C and D and so on? So, one of the things that tracing brings to the table is it tells you what is currently happening and what called that: “Oh, I know that I’m in service D, but I was actually called by service B, and I’m not just relying on time stamps to try and figure out that connection. “

You have that information and ultimately, the data model that allows you to fully reflect on what’s happening with the request, particularly in complex environments. I think this is where tracing … needs to be used in a scenario where you really have trouble grasping from a conceptual standpoint what is happening with the request because you need to fully document it.

Continue this part of the conversation at 10:00

Instrumenting with OpenTelemetry

Corey: I had a passing fancy where I built this online freely-available Twitter client for authoring Twitter threads. I‘ve used that as a test for a few things, and it is now deployed to roughly 20 AWS regions simultaneously. I’d kind of like to log and see who’s using it.

I figured that this is a perfect little toy application. It runs in a single Lambda function, so it’s not that complicated. I could instrument this with OpenTelemetry, which, at least according to the instructions, I could then send different types of data to different observability tools without having to re-instrument this thing every time I want to kick the tires on something else. That was the promise.

This led to three weeks of pain — because it appears that, for all of the promise that it has, OpenTelemetry, particularly in a Lambda environment, is nowhere near ready for being able to carry a workload like this. Am I just foolish on this? Am I stating an unfortunate reality that you’ve noticed in the OpenTelemetry space? Or is OpenTelemetry the wrong approach?

Ian: I think that OpenTelemetry is absolutely the right approach. To me, the promise of OpenTelemetry for the individual is: “Hey, I can go and instrument this thing as you said, and I can go and send the data wherever I want.” The larger view of that is: “well, I’m no longer beholden to a vendor, including the ones that I’ve worked for, including the one that I work for now for the definition of the data, I am able to control that. I’m able to choose that I’m able to enhance that and any effort I put into it, it’s mine. I own that.”

Whereas previously, if you picked, for example, an APM vendor who said: “Oh, I want to have some additional aspect of my information provider. I want to track my customer, or I want to track a particular new metric of how many dollars I am transacting.” That effort is really going to support the value of that individual solution. It’s not going to support your outcomes, which is: “I want to be able to use this data where you want, where it’s most valuable.”

Ian: The core premise of OpenTelemetry is great. I think it’s a massive undertaking to be able to do this for at least three different data types. Defining an API across a whole bunch of different languages, and across three different data types, and then creating implementations for those, because the implementations are the thing that people want, right? You were hoping for the ability to drop in something — maybe one line of code or preferably just attach a dependency, in Java and at runtime, and be able to have the information flow through and have it complete. And, this is the premise of new vendors I’ve worked with in the past. Having that out of the box visibility is a goal of OpenTelemetry, wherever it makes sense.

Continue this part of the conversation at 14:38.

We don’t need the multifunctional printer: The problem with too many capabilities

Corey: One thing that I was consistently annoyed by, in my days of running production infrastructure at places like large banks, for example, is the idea: “Instrument your applications with our libraries, or our instrumentation standards.” It felt like I was constantly doing and redoing a lot of instrumentation for different aspects. It’s not that we were replacing one vendor with another, it’s that in an observability tool chain, there are remarkably few one size fits all stories.

It feels increasingly like everyone’s trying to sell me a multifunction printer, which does one thing well, and a few other things just well enough to technically say they do them, but badly enough that I get irritated every single time. That’s the ideal light at the end of the tunnel for me, in what OpenTelemetry is promising for me: Instrument once and then you’re just adjusting configuration as far as where to send it.

Ian: Companies that have over the last two years really invested heavily in OpenTelemetry are definitely getting to the point now where they’re generating the data once, they’re using pieces of the OpenTelemetry pipeline, they’re extending it themselves and then they’re able to shove that data in a bunch of different places. Maybe they’re putting it in the data lake for business analysis purposes, forecasting, or they’re maybe putting the data into two different systems, even for instant and analysis purposes, but you’re not having that duplication effort. Also potentially that performance impact of having two different instrumentation packages that lineup with each other.

It’s all about quality (of vendor features) over quantity

Corey: There is a recurring theme that I’ve noticed in the observability space that annoys me to no end. There are so many startups that I have seen and worked with in varying aspects of the observability space, where I think “This is awesome. I love the thing that they do.”

Invariably, every time, they start getting more and more features bolted onto them t feels like they keep bolting things on and bolting things on where everything is, more or less, trying to evolve into becoming its own version of Datadog. What’s up with that?

Ian: Yeah, the sort of dreaded platform plague. I was at New Relic when there were essentially two products that they sold, and then by the time I left, I think there were seven different products that were being sold, which is kind of a crazy thing when you think about it. I think Datadog has definitely exceeded that now. I do see many vendors in the market, and even open source solutions, sort of presenting themselves as this integrated experience. But to your point, even before your experience at these banks, it oftentimes becomes sort of a tick a box feature approach of, “I can do this thing, so, so buy more and here’s a shared navigation panel.”

One of the things that I do in my role is that I get to work with our internal product teams very closely particularly around new initiatives like tracing functionality. And the constant sort of conversation is: “What is the outcome? What is the value?” It’s not about the feature, it’s not about having a list of 19 different features. It’s more about: “What is the user able to do with this?”

There are lots of platforms that have metrics, logs, and tracing. The new one-up-manship is saying: “We have events as well, and have incident response and we have security.” And all these things sort of tie together, so it’s one invoice.

I talk to customers and I ask what are the outcomes you get when you’ve invested so heavily in one vendor? And oftentimes the response is “ I only need to deal with one vendor.” But that’s not an outcome. That’s the business having a single invoice.

Greater value and greater outcomes

Ian: I do think that you people are hoping for that greater value and those greater outcomes. So, being able to actually provide differentiation in that market, I don’t think is terribly difficult. There are still huge gaps in, let’s say root cause analysis during investigation time. There are huge issues with vendors who don’t think beyond just the one individual who’s looking at a particular dashboard, or looking at whatever analysis tool there is. So, getting those things actually tied together is not about metrics and logs and traces together, but if you say “We have metrics and tracing,” how do you move between metrics and tracing?

One of the goals in the how we are developing product at Chronosphere, is that if you are alerted to an instance, you as an engineer, doesn’t matter whether you’re a lead architect who’s been with a company forever and you know everything, or you’re someone who’s just come out of onboarding and it’s your first time on call, you should not have to think, “Is this a tracing problem or a metrics problem or a logging problem?”

This is a thing that I mentioned before, [the requirement of that] really heavy level of knowledge and understanding about the observability space, and your data, and your architecture, to be effective. Particularly with observability teams and all of the engineers that I speak with on a regular basis, you get this sort of circumstance of “Let’s talk about a real outcome and a real pain point” because people are like, “This is all fine, it’s all coming from a vendor who has a particular agenda.”

But the thing that constantly resonates is that for large organizations that are moving fast, big startups, Unicorns, or even more traditional enterprises that are trying to undergo a rapid transformation and go really cloud native and make sure their engineers are moving quickly.

A common question I will talk about with them is, “Who are the three people in your organization who always get escalated to?” It’s usually between two and five people.

Spreading knowledge across your engineering team

Corey: At least anyone who’s worked in environments or through incidents like this, more than a few times, already have thought of specific people in specific companies. And they almost always fall into some very predictable archetypes, but please continue.

Ian: And people think about these people; they always jump to mind. One of the things that I ask about is that when you did your last innovation around observability, like introducing a new data type, or doing some big investment in improving instrumentation: What changed about their experience? And oftentimes the, the most that can come out is that they have access to more data. That’s not great. Are they still getting woken up at 3 a.m.? Are they constantly getting pinged all the time?

One of the vendors that I worked at, when they would go down, there were three engineers in the company who were capable of generating the list of customers who are actually impacted by an outage.

This is where I think the observability industry has sort of gotten stuck: “Can you do it? Yes. But is it effective? No.” By effective, I mean those three engineers become the focal point for an organization. It doesn’t matter whether you’re talking about a team of a hundred, you’re talking about a team of a thousand, it’s always the same number of people [that are called during an outage].

The vanity measures of “more is more”

Ian: And as you get bigger and bigger, it becomes more and more of a problem. So does the tooling actually make a difference to them? And you might ask, “well, what do you expect from the tooling? What do you expect to do for them? Is it that you give them deeper analysis tools? Is it that you do AIOps?”

The answer is how you take the capabilities that those people have and how you spread it across a larger population of engineers. And that, I think, is one of those key outcomes of observability that no one, whether it be an open source or the vendor side, is really paying a lot of attention to. It’s always about shoving more data in and “we’ve got petabyte scale and we can deal with 2 billion active time series” and all these other sort of vanity measures.

We’ve gotten really far away from the outcomes, like, “am I getting return on investment of my observability tooling?” And I think tracing can be difficult to reason about, right? People are not sure, and think “I’m in a microservice environment, I’m in cloud native, I need tracing because my older APM tools appear to be failing me. I’m just going to go and wiggle my way through implementing OpenTelemetry, which has a significant engineering cost. I don’t know what to expect, so I’m gonna go and throw my data somewhere and see whether we can achieve those outcomes. And I’m going to do a pilot and send my most sophisticated engineers to the pilot and then they’re able to solve the problems. Okay, I’m gonna go buy that thing.”

But I’ve just transferred my problems … and probably just cost myself a lot, both in terms of engineering time and raw dollar spend as well.

Taking it back to the cloud native observability fundamentals

Corey: One of the challenges that I’m seeing across the board is that observability, once you start to see what it is and its potential for certain applications, it’s clear that there is definite and distinct value versus other ways of doing things. The problem is that value often becomes apparent only after you’ve already done it and can see what that other side looks like. How do you wind up viewing the return on investment to go ahead with instrumenting for observability in complex environments?

Ian: I think that you have to look at the fundamentals. Pretend that we had just invented logging and needed to start small. I’m not going to go and log everything about every application that I’ve had forever. What I need to do is to find the points where logging is going to be the most useful, most impactful, across the broadest audience possible. And one of the useful things about tracing is that because it’s built in, distributed, and what’s primarily for distributed environments, you can look at, for example, the biggest intersection of requests.

For anyone who’s used Prometheus or decided to move away from Prometheus, no one’s ever gone and evaluated Prometheus’ solution, without having some sort of Prometheus data, right? You don’t go, “Hey, I’m going to evaluate a replacement for Prometheus or my StatsD without having any data, and I’m simultaneously going to generate my data and evaluate a solution at the same time.” It doesn’t make any sense. With tracing, you have decent open source projects out there that allow you to visualize individual traces and understand the basic value that you should be getting out of this data. It’s a good starting point to go, “can I reason about a single request? Can I go and look at my quest end-to-end even in a relatively small slice of my environment, and can I see the potential for this, and can I think about the things that I need to be able to solve?”

Investing & instrumentation

Ian: The real problem is external dependencies — Facebook API is the one that everyone loves to use — I need to go instrument that. And then you can start to get more clarity. Tracing has this interesting network effect. You can basically just follow the breadcrumbs. And you can sort of take that exploratory approach, rather than doing everything all upfront. But it is important to do something before you start trying to evaluate what your end state is — and where you want to be in two years time. Maybe it’s an open source solution, maybe it’s a vendor solution. Maybe it’s one of those platform solutions you talked about. But how do I get there? It’s really going to be, taking the right approach — being very clear about the value and outcomes.

There’s no point in doing a whole bunch of instrumentation effort in things that are just working fine, right? You want to focus your time and attention on that, but also you don’t wanna go and burn singular engineers.

The observability team’s purpose in life is probably not to just write instrumentation, or just deploy OpenTelemetry — because then we get back into the land where engineers themselves know nothing about the monitoring or observability that they’re doing.

Balancing on return and ownership

Ian: A level of ownership supported by the observability team is really important. On that ROI thought, it’s not just the instrumentation effort. There’s product training and there are some very hard costs.

People often think “well, I have the ability to pay a vendor, that’s really the only cost that I have.” There’s things like egress costs, particularly volumes of data. There’s the infrastructure costs. A lot of the time, there will be elements you need to run in your own environment. Those can be very costly as well. And ultimately, there are sort of icebergs in this overall ROI conversation.

The other side of ROI, there’s a lot of difficulty in reasoning about, “What is the value of this going to be, if I go through all this effort?” Everyone knows an archetype of the three options, pick two, because it’s always gonna be a trade off.

For observability, it’s become an element of “I need to pick between performance, data fidelity, or cost and pick two.” So, there’s a lot of different things you need to balance on that return. And as you said, oftentimes you don’t get to understand the magnitude of those until you’ve got the full data set in and you’re trying to do this for real. But being prepared and iterative as you go through this effort and not saying, “okay, I’m just gonna buy everything from one vendor because I’m gonna assume that’s gonna solve my problem.” It is probably the undercurrent there.

Looking at SaaS solutions today

Corey: As I take a look across the entire ecosystem, I see that there’s a strong observability community out there that is absolutely aligned with the things I care about and things I want to do. And then there’s a bunch of SaaS vendors, where it seems that they are, in many cases, yes, advancing the state of the art, but I do think that when the tool you sell is a hammer, then every problem starts to look like a nail, or in my case, like my thumb. Do you think that there’s a chance that SaaS vendors are in some ways making this entire space worse?

Ian: As we’ve sort of gone into more cloud native scenarios, and people are building things specifically to take advantage of cloud from a complexity standpoint, from a scaling standpoint, you start to get vertical issues happening. So you have conversations like, “We’re gonna charge on a per container basis. We’re gonna charge on a per host basis. We’re gonna charge based on the amount of gigabytes that you send us.” These are sort of more horizontal pricing models. And the way the SaaS vendors have delivered this is by making it pretty opaque, right?

Everyone has, experiences, or has jokes about overages from observability vendors, massive spikes. I’ve worked with customers who have accidentally used some features and they’ve been billed a quarter million dollars on a monthly basis for accidental overages from a SaaS vendor. And these are all terrible things but we’ve gotten used to this — we’ve just accepted it, right? Because everyone is operating this way. And I really do believe that the move to SaaS was one of those things like “Well you are throwing us more data and we’re charging you more for it as a vendor.”

Listen to the full episode to learn more about today’s observability space, and the importance of choosing the right solution. Continue at 34:19.