Podcasts

Podcast: Slight Reliability – Dashboards and Modern Observability with Eric Schabell

In this episode of the Slight Reliability podcast, Stephen Townshend sits down with Chronosphere’s Eric Schabell to chat about how unmanaged observability can lead to high cloud bills, the world of dashboards and documentation, and Chronosphere’s knowtriage, and understand workflow.

Transcript

Steven: Welcome to Slight Reliability – the show where we learn SRE and observability, one week at a time. I’m Steven Townsend. Today on the show, I have Eric Schabell from Chronosphere. Eric is Chronosphere Director Evangelism. He is renowned in the development community as a speaker, lecturer, author, and a baseball expert. His current role allows him to help the world understand the challenges they’re facing with cloud native observability. He brings a unique perspective to the stage with a professional life dedicated to sharing his deep expertise of open source technologies, organizations, and is a CNCF Ambassador. So, welcome.

Eric: Thank you, nice to be here. This has been one of my sources for learning about the SRE world, and about observability over the last year … I spent probably about 25 years in the app dev space. So as, you know, starting off as a developer, coding Java, all that kind of stuff, through the years, which is not something you’d expect to find me on a site reliability podcast. About a year and a half ago, I decided to step into this role in this organization. It’s a new world for me. It’s been kind of fun because I’m also someone that’s used to documenting the experience and sharing what I do. I’m a teacher, basically, in this kind of a role. I’ve spent some time with a series called O11Y Guide, [which is] step for step my process: what I’m learning, what I’m seeing, what I’m doing, a lot like what you’re doing through this podcast. Mine’s just a little bit wider of a view because cloud native observability is a big topic.

Steven: What attracted you to observability? Was that something you were working on in that space before or was it a brand new field almost?

Eric: I noticed when you’re working in technical marketing or as a developer advocate, you start talking about a certain direction of technology, a strategy. You start developing insights and sharing and growing in that direction. It might take you left, it might take you right, depending upon your personal interests. I noticed that I was starting to observe cloud native development and cloud native environments, containers and Kubernetes and all the CNCF stuff. I was very much noticing that cloud data is an issue, right? Customers are getting hammered. It’s become mainstream now. But imagine about two years ago, people were pretty shocked and stunned that when you go into a marketing marketplace on any of the major providers, and you click on something to use it as a developer, you’re spinning up stuff, you’re enjoying your stuff, and I’m building a demo and it’s all fantastic and syncing – with GitHub and all that stuff.

Then, I get the bill after a week of building a demo and it’s 800 bucks. And I’m like, why? And it turns out, it’s got a lot to do with the observability stuff that’s just dumping data on me that I’m not using as a developer. So, that was my first experience. The second thing is, I started digging into that. I spent a better part of a year on stage talking and warning people about this. I have a talk that I’ve repeated many times called “The Three Pitfalls Everyone Should Avoid With Cloud Data.” And that’s been so well received that you’re onto something, right? So then, I started thinking about what’s next after I’m going to move on to a new role. Observability stuff kind of popped up in my list of interviews, and they’ve been looking at what I’ve been talking about, and then asked me what I thought about it.

So I haven’t sat as an SRE on-call that much, but I’ve seen the impact of the results of these large at scale deployments in the cloud with customers that are fairly huge in my old role across the globe. And everybody’s struggling with the same thing. So, that’s kind of how it sucked me in. Because I was like: “Yeah, this seems like what I’m interested in. This is where my strategy and my discussions are taking me, and I’m really excited to get into new technology. It’s really, really fun.” And, where I’m at now, the founders from Netflix did this at scale, have been through it all, and there’s so much to learn from these guys. So, it’s been a fantastic first year for me here at Chronosphere.

Steven: The sort of general topic we were thinking about discussing was around dashboarding. One of the things, this is maybe just my impression, but I get the sense that in the cloud native observability or modern observability community, I wouldn’t go so far as to say they’re [metrics] aligned, but there’s a weight towards distributed tracing is the way to go, and that is the key to modern observability. But hey, dashboards are metrics, right? So, I guess my question is, and this is more of a general starting point for discussion: What is the place of dashboards or of metrics in general in modern observability?

Eric: I think a little bit of what you’re touching on is the transition between what we’ve seen, where you have application performance monitoring kind of organizations and tooling – where metrics are your big focus, and you understand the environment: It’s not auto-scaling. It’s a very targeted environment, sort of what we could consider the second generation of observability to where we’re at now – where it’s large, born in the cloud companies that have done this at massive scale, or started at a smaller scale, but it explodes through the setup, through the Kubernetes auto-scaling. When they become successful, we saw in the pandemic, many companies just scale through the roof, right? The whole business pattern has changed and how we lived our lifestyles: DoorDashes, and things like that.

I don’t think the story should be just about technical aspects. People always like to talk about metrics, tracings, logs, you gotta have that. What you are doing with this stuff is more important. How you’re approaching this from a business standpoint is how you’re getting the answers that you’re looking for in these dashboards. So,   something that we talk a lot about here at Chronosphere is know, triage, and understand. And you notice all three of these words are business words. They’re not technologically focused. I like to explain this stuff from the open source world. People love to talk about the technology and the functionality, but when you try to sell that, you’re talking to the developer at the bottom of the pile. That’s not the decision makers at the top. There’s nothing wrong with that, but it’s a different aspect of how you’re approaching the problems that these businesses are dealing with.

So, when you look at the third generation of observability, we’re no longer interested in just tracking metrics, just tracking traces, just tracking logs. Fantastic. I can make a dashboard to dump a big old log in your face. Does that change anything? You know, it changes nothing. And so, what you’re trying to figure out is a way to have alerts trigger something to get somebody to a dashboard that starts the investigative process, and leads you down to the answer – sort of like good documentation does, which is one of the analogies I like to use a lot with dashboards.   Good documentation for any project takes a lot of effort. It’s a constant thing. It’s improving all the time. It’s adjusting all the time. And the idea is, that you’re digging down through a bunch of docs to get to where your answer is. You’re not grabbing this pile of docs and then switching to this pile of docs and switching to that pile of docs. That’s what happens in a lot of those second generation organizations, where they’ve been going at it so long in a certain form with just metrics. All these dashboards all over the place that are just disjunct and maybe a link between them.

I think you’ve had my colleague Paige on here, where she talks about when you’re on-call. She’s the youngest retired SRE, right? She’s burned out, done and doesn’t want to do it anymore.This last Christmas, she was sending me text messages like: “Hey, this is the first time I haven’t been on-call!” She was so happy. That’s because when you’re on call, when it goes off, you think: “How am I going to figure out, and dig through this stuff?” The idea is that you want to land on a starting point, which includes maybe documentation, a few simple things, maybe the alert message that got triggered, and then go down deeper and deeper to figure it out. When it’s done well, you don’t even have to have super rockstar, technologically advanced SREs. You can train younger ones to do this to a certain point.

That’s the ideal world, right? That’s nirvana: Being able to know that something is going on, being able to triage it really quickly and remediate within those two steps somewhere. Remediation doesn’t have to mean rolling back what you just did. It could also just be filtering out that metrics cardinality spike that you had happen – just stop it right there. Contact some people and find out what they want to do from that point on. You just saved the company a ton of money by not storing all these extra labels that have been generated.

And then, being able to always go back and understand deeply what happened so you can fix it, adjust your dashboard approach or adjust some of the documentation for the next person that might encounter something like this. So, that’s kind of the vision with the know, triage and understand approach, that brings us a lot farther than just talking about: “Hey, metrics are outdated.” I don’t think it matters to anybody – when they get to their alert and they come into the office – if it was a metric, log, or a trace, or if it was one metric, three traces, and two lines in a log that got me the answers I needed. They really don’t care. I would rather have a unified experience that embeds this in some way, in some form in the dashboards – as I go through my documentation dive to figure out what’s going on.

Steven: So many things I’m going to hear from that.

One of them is that, as you said, dashboards are a bit like documentation. Documentation is a source of technical debt in a way, because you’re creating something which has a use, but if you don’t keep maintaining it, it loses its value. I’ve certainly seen situations where dashboards just build up, get out of date, don’t even work anymore, and get left. And they’re just sitting there, wasting away. In the sort of modern era when things are moving fast and they’re really complex, are dashboard’s a thing that people can still create and craft manually to tell stories? Or are they more the kind of thing that we should be using automated scripts to generate based on templates?

Eric: I would love to have both, as long as they’re successful, right? I don’t think the journey is important. I think the end result is important. I spent some time in the military, and they used to always tell us that the fastest distance between two points is a straight line, right? So, quit trying to go around stuff. Go straight through it. And, I feel like you’re looking to get to the end results where you’re not burning out your developers, engineers, or your on-call team. Where, you’re looking to get remediation done faster, where customer experiences are better. Where we have good insights into what’s going on when you can do these things. And the cost is not such an issue. The cost is to be an issue when you’re unable to achieve these things. Then, people start scratching their heads: “Why am I paying more for observability data than I am for my actual production, customer data and stuff that’s getting me money?”

This needs to be balanced by this. And when it’s not, it’s full stop. And what the hell are we doing? I think auto generating stuff would be fantastic. If that works, great. Everybody’s promised that AI’s going to take care of all this, right? We can just plug in the on-call AI bot. Not that it will be predictive on AI or anything, but I think it’s something that’s going to evolve that might find niche cases where it’s going to work. I always like to refer back to the day when I first started at university doing a computer sciences degree. We had AI promising us a medical diagnosis, and everything’s going to be fantastic – no more mistakes by doctors, or at least 99% of the stuff was going to be covered by these AI things. It’s still not happening. You’re like: “Here it is again, coming back around. Now it’s our better search engine chatbot. It seems kind of strange that it’s going to fix everything. So that part right there, like auto generating your dashboards, I don’t know anybody that’s going to put that much effort into it.

Steven: My experiences with APM tooling that has AI stuff doing this and that: I’ve always found that it just generates a ton of garbage unless you consciously configure it in certain ways so that it understands your context. And then, at that point it’s like: “Well, how AI is this? I’m fitting so much into  it anyways.”

I did performance testing for 13 years, we could automatically correlate dynamic values and blah, blah, blah. But it never, ever worked. And then, when something goes wrong, because it’s all automatically done, you have no idea how to fix it. It didn’t matter. Even in my 13th year, I always did things manually. I forgot about all that. I want to do it manually, because I want to understand what they  are saying to each other every step of the way.

Eric: If I look back into the development days when I was coding and using an IED, and when you deploy something, the most dynamic it got as far as services and stuff is that you’d have a service bus and you might have a lookup service where you can find the services you want to call. And I was like,  you’re putting breakpoints. It’s very static, nothing like it is now. And then, you see people having to deal with as a developer, telemetry data and: “What happened the last time I made this call out into the wild, and where did it go and why?” Looking at traces and spans, it’s no longer just an ops kind of thing.

I always kind of had this little hidden joke in all the talks I did around DevOps, where I seriously come from the dev side, right? That’s what I love.That’s why I signed up for it. That’s why I came to this role wherever I’m at, at that time. And then, we talk about DevOps, but why am I doing, and, and, one of our research things that came out this last year talked to 500 engineers. 10 hours a week are spent on triaging and dealing with deployment problems. Do you think I, as a developer, signed up for that much? This is not making me happy. You wonder why they’re stressed, burned out, and they don’t want to do the job. If you wanted to do that, then you should have got somebody that wanted to do that, right?

There are people that do admin stuff that prefer to do that, and don’t want to develop. Fair enough. There’s plenty of room in everybody’s world to do what you want to do, and why would you do something you don’t want to do in our business, when there are so many openings and so much shortage of resources, right?

That’s really rough, right? So, the dashboard thing, you see that creeping more and more into the developer space also because, when you have an observability team or centralized observability that can make stuff available for developer in certain environments – that basic knowledge of what’s happening with memory usage and things like that, it’s much easier than me trying to figure it from my IDE, these days. Let’s be honest, how many plugins am I gonna put in my VS code? It’s off the planet, what you can do with this stuff nowadays.

Steven: That’s interesting – what you’re saying about developers having to take on this new responsibility. I’ve never really thought about a lot of that before – because I’ve thought about it from an organizational perspective. If you have people who understand the context of the code that’s being written, surely they can have an easier time debugging things, as opposed to a separate isolated team. But then, I never really thought about the individuals and now: “Hey, you have to do all these extra skills and things that you may not like doing. I’ll take that there with me.”

Eric: It’s hilarious. I have a slide somewhere in one of my talks, and it comes out of one of the chat things online where somebody was asking: “Hey, OpenTelemetry and tracing, what is all this stuff? Is this valuable for me to integrate into my observability stack?” The first really dry, sarcastic reaction that came back was: “I’m in an organization of about a thousand developers. I can’t even get them to look at metrics, let alone tracing. I was like: “Oh, it doesn’t matter how great your tooling is, it really doesn’t, and what it delivers if they’re not interested, you know what I mean?” If you can’t get anybody to pick up the tool, what the hell?

I like to point out: How many times have you seen somebody out in his garden doing some work on some woodwork or whatever, and he’s nailing up a fence. What do you do – nail a hammer, right? But you see the guy out there with a wrench whacking a nail. Because that’s the closest thing he’s got at hand. I don’t care how great your hammer is. If it’s just outta reach, he’s not going to go get it.

Steven: For the vast majority of people who work in technology, they just work for regular organizations. They don’t work for Meter, Google, or Amazon or something like that. They don’t work for some unicorn startup either. They work for an insurance company, bank or a government department. Most of my experience in those organizations is that setting up distributed tracing at an organization level is nearly impossible. There’ll be patches of getting some pretty advanced stuff going, but that’s not the world. It’s very difficult when there’s so much history. Maybe the culture isn’t technology first.

Eric: I also think it’s getting us back to the dashboards and stuff. If you’re bringing all this into the organization, wouldn’t it be nice if it was just part of the integrated experience where: “Turns out this time around, two spans,   half a trace and a log line was all I needed to figure it out.” The guy behind the screen doesn’t give a crap what you use to get there. So, integrating your traces into your organization in this fashion is a different story than saying: “Hey, go to my dashboard for OpenTelemetry, and start digging into the Jaeger things and look … And there ends up being like three people in the org that actually use it, and get value out of it.

It’s really funny. We put together a in, in our product, we have the ability to look at some of the data that you’re ingesting, your metrics data. It doesn’t really matter where it’s coming from, but it’s all flowing in. You can see it live, but you can also flip a switch and it’ll show you everything that is not being used at all. So it sorts it from the least used to the most used metrics. It even lets you dive down into the labels. If you find something that is used, half its labels aren’t being touched, so you can filter that stuff out before you persist. It saves you a lot of money, right? That’s the way most of the stuff works.

What was really interesting is, we found a couple of examples where something was being used by two SREs, but these were the two most experienced and rockstar SREs. The guy that saw that during a pilot was like: “Hey, what’s going on with these two? Let’s go talk to them about why they think this is important. Maybe we need to include this for everybody else.” And it turned out it was in addition to a dashboard that could help a lot more people in the organization with what they were looking at. That’s the kind of stuff you want to find. That’s the kind of stuff you need to expose in your dashboards – not: “We have tracing, go to the tracing dashboard.”

It’s how it’s individually used and where the value is in the data you have coming in for your organization. That’s why, for a lot of the questions that you and I have talked about,   leading up to this, it’s really hard to give a definitive answer because, is it a small organization you’re describing? Is it somebody that’s not that deep into the cloud native stuff? Is it somebody that’s full blown at scale? What’s your use case for that specific on-call person you’re trying to set up the alerts for? What area are we dealing with? What business metrics are important to you? What do you want to know, triage and understand in this environment? It’s always different for each one we get into.

Steven: I have this topic that I’ve titled “A Beautiful and Ugly Dashboard.”

I interviewed a guy called Jamie Allen this morning, also similarly about dashboards focusing on the whole idea of a single pane of glass. And, we were talking about how he worked with an organization once who had an entire wall of single page panes of glass, which are just filled with widgets, like an entire wall of just TV screens back to back to back, which no one even looked at, but they looked beautiful. They look stunning.

Eric: I was once in Boston, and I think that’s the headquarters of Akamai. Akamai runs like all the world’s internet. They’re huge, right? They have a glass area on the street where you can look into their centralized operations. I don’t know if that’s what it is or not, but that’s the impression they’re giving. And man, is that a wall of dashboards with screens. It looks like Command Central for NORAD. I don’t know what it’s about. I don’t know if they know what it’s about, but there’s a lot of stuff up on the wall, and a lot of numbers going by, and a lot of lines being connected.

Steven: Maybe someone from Akamai’s listening, and they can confirm for us what that’s all about. Is it just for show, or is it really being used?

Eric: I mean, it’s super impressive, which is probably what they’re trying to do. I’m sure some of that’s interesting when they need to dive into a specific one for whatever they’re doing. But, when you walk by something like that, you just kind of scratch your head … Good god, how did they do that? Pretty impressive.

So, what was his view on a single pane of glass? Was he all in on it?

Steven: His view was that it is useful, provided that, like you were saying, it helps you triage and it has to be simple enough that you basically know: “Does it have a problem? If it does, I need to better drill down to the next layer to at least understand which area the problem is in before we get into the more detailed analysis.” That was his view. And I completely agree with that.

Eric: I understand that when it’s a single pane of glass, you have one starting point. For me, the single pane of glass goes away the minute you can dive down deeper, right? So, now it’s no longer a single pane. You have multiple panes you’re diving down into. I feel like it’s a cop out to say that you have a single pane of glass, because nobody can achieve that unless you have a dashboard that’s like 50 foot by 60 foot and has everything. It’s flattening it out into a single pane, the idea of not going any deeper.

Steven: I think of it more as, if you can have a single view,  which can answer the big questions that you need to know about something. I think it comes down to context, right? If I have a very simple service, can my customers use my service, and are they having a good time? If I can answer that question in one place, I will call that a single pane of glass.

Another sort of cliche in our dashboards is, I’ve been in offices where you’ve got these TV screens with dashboards showing the health of infrastructure and applications and services and so on. A lot of people work from home now. And I was wondering, as a person working from working from home, when do you go and look at dashboards? Do you have to build a routine to go and check up on them? Or is it triggered by an alert that brings you to a dashboard? Is that sort of the model that you see?

Eric: I think it’s more so that. I mean, there’s always running dashboards. For example, a really famous one is that they have a bunch of boxes at the top that might represent business areas of services or whatever, like an order service and whatever that they’re doing. Are they green, are they red, are they yellow? What do they look like? What does that mean? That has to mean something to you, but I assume green is good, red is bad, you know, that kind of thing. So, that’s just one you’d browse through, right? But, how’s the business doing right now? That’s got nothing to do with where you’re going to start your triaging thing other than that maybe you’re looking for a red one. But generally speaking, alerts are tied to bringing you into an area of the business that you’re covering with that on-call service you’re doing. There should be a starting point that narrows it down much more than looking at the whole business. You’re trying to save some steps … You get launched into a little bit narrower view. And, we definitely try to do that – quantifying it down to something that then you drill down into directly, and it unifies the metrics, the traces and all that.

Steven: You introduced me to something I’d never heard of before called the Perseus Project, which is a project about dashboarding. I probably should have known about it, considering I work for a dashboarding vendor. I was curious, what is the Perseus project?

Eric: That is something that got kicked off right after Grafana changed their licensing not too long ago … It was the open source standard of what you would go use as a dashboarding target point. So, everybody had it everywhere. Being true open source, just like a lot of the stuff I encountered while I was working at Red Hat, people embed open source everywhere. They use the tools where they want to use them. They changed the Apache license so that if you’re embedded with it, you have to turn any changes you make back into the community. And that kind of makes it a no-go for a lot of vendors, of course. Looking for some way to get to a native dashboarding framework.

I do know, and I built a workshop around it just because I was fascinated by what was going on with this, it was really early days, so it was not so much drag and drop, it was more doing the actual YAML or JSON code to get your dashboard together. It’s a bit far-fetched to expect people to do that, but it was such early days that that’s the condition it was in. You saw really quickly, places like Amadeus, even Red Hat, people are involved. So, you see there’s a real coalition around interest in this space. And plus, people have been working with dashboards so long they think: “Hey, we can do it better this time, right?” As you do in open source. So, you end up with a cleaner, more performant kind of framework, I think is where it’s headed.

We’re not deeply involved in the sense of leading the project or anything like that. But we have people that are active in that space. You see them on the channels of the community. They use Matrix as sort of a web like IRC from the day. It’s a chat about stuff. I find it utterly fascinating. It’s a really clean looking dashboard and it’s a dashboard, right? … It’s not about the project you’re using to build the dashboard. It’s about what you’re doing to make your dashboards effective for your people.

I don’t get too deep into what you’re using to do your dashboards. It’s all good for me. I have had a heavy bias for open source all through my career since I started. I’ve been involved in lots of projects in lots of different places and contributed and wrote books about the stuff. I like open source projects, and it’s really fun to get hooked into one so early and see it grow. And, rumor has it, by the end of this year, they are going to launch their project officially, probably through PromCon EU.  I’ve heard rumors that that’s what they’re trying to do. So, I’m going to update my workshop, make sure it’s available. It’s like a supporting piece for them.

Steven: Is the project something that practitioners can use too, for their own work? Or is it more the kind of thing which vendors would use to embed dashboarding in their products, or both?

Eric: I mean, anybody can use it for whatever you want, and there’s no limitations because it’s back to an open source license.

Steven: That sounds cool. And it sounds like it’s not just a tool, it’s an open standard for dashboard visualization.

Eric: A core dashboard – the overarching thing. I believe it’s been included in the Linux Foundation to start with. There was talk of moving into CNCF maybe one day. I mean, it’s such early days. That’s all kind of speculation right now. First you need to build something and they’ll come.

Steven: I did have one question for you, as someone coming from this sort of development world. Have you got any advice for anyone who’s currently working as a developer, but is thinking about, or is maybe being sort of pushed into a more operations type role or mixed DevOps role?

Eric: Run. No, I’m just kidding. I think you kind of have to embrace it these days. My generic advice about any role you have, anything you’re doing in your job and in your professional life, is that if you come into the role new, you spend probably 99% of your time doing what they’re asking you to do, trying to figure out what’s going on, trying to embed yourself into the activities that are going on in your environment. Fair enough. The whole goal is to narrow this down to 90%. Then you have 10% playroom to figure out what you like, and to invest into what you like within the role – maybe spend more time on development than on ops, or maybe spend more time on Kubernetes than on Java language, or on certifications or whatever it happens to be.

Then, pretty soon, you get that down as you get more and more senior. That playroom gets bigger because you’re more effective. You’re able to produce more than enough work to cover what is expected of you, and you start being able to have hobbies and to improve what’s around you –  like maybe invest in pipelines, or making things automated, or make it easier for new developers coming in, or spend some time on documentation or dashboards. So, I would say my advice is: If you can’t get to a point where you’re doing maybe 70-30, I don’t think you’re ever going to be happy where you’re at. I think all of us do this in some form or another. I mean, are you having fun doing these podcasts? I’m sure this is kind of a thing you carved out in the role.

It wasn’t something they described from day one. That’s what happens. You start making fun things part of what your work is. And, if you can’t find that, we spend so much time at work in our lives, then you probably need to find a different role, right? If it’s really not your thing to do that kind of operation stuff, then maybe you don’t want to do that – go find a silly developer role in the corner, where you can just code. If that’s really what makes you happy, then go do that. I have friends that do that. They absolutely love that. They’re experts in their languages. So be it. It’s whatever makes you happy and gets you with a big smile behind your computer every day.

You’re not going to function well if you’re not finding that happiness in your work. So, that would be my biggest thing. After all the years I’ve done, I constantly strive to find hobby projects within my work. And they tend to align with work in some form. It’s not that I’m just off Red Sox writing while I’m working at work, that’s not what I’m talking about. I’m talking about developing workshops, or writing about architectures that are on the edge of what we’re doing, but one day are going to be relevant. It just interests me. It’s my interest. So, if you can’t expand your own horizons, what the heck? Maybe it’s time to find a new place.

Steven: I think that’s good advice. This podcast was a hobby of mine. I did it in my spare time on the weekends. Someone saw it and said: “Hey, that’s exactly what we want for our business. And here I am doing it, there you go.

Thank you so much, Eric, for coming on the show and talking about dashboards and everything else we talked about. That’s all – from another episode of Slight Reliability, and I’ll see you all next week.