Senior Developer Advocate
Paige Cruz is a Senior Developer Advocate at Chronosphere passionate about cultivating sustainable on-call practices and bringing folks their aha moment with observability. She started as a software engineer at New Relic before switching to Site Reliability Engineering holding the pager for InVision, Lightstep, and Weedmaps. Off-the-clock you can find her spinning yarn, swooning over alpacas, or watching trash TV on Bravo.
In my sordid career, I have been an actor, bug exterminator and wild-animal remover (nothing crazy like pumas or wildebeests. Just skunks, snakes, and raccoons.), electrician, carpenter, stage-combat instructor, ASL interpreter, and Sunday school teacher. Oh, yeah, I’ve also worked with computers.
While my first keyboard was an IBMs electric, and my first digital experience was on an Atari 400, my professional work in tech started in 1989 (when you got Windows 286 for free on twelve 5¼” when you bought Excel 1.0). Since then I’ve worked as a classroom instructor, courseware designer, helpdesk operator, desktop support staff, sysadmin, network engineer, and software distribution technician.
Then, about 25 years ago, I got involved with monitoring. I’ve worked with a wide range of tools: Tivoli, BMC, OpenView, janky perl scripts, Nagios, SolarWinds, DOS batch files, Zabbix, Grafana, New Relic, and other assorted nightmare fuel. I’ve designed solutions for companies that were modest (~10 systems), significant (5,000 systems), and ludicrous (250,000 systems). In that time, I’ve learned a lot about monitoring and observability in all its many and splendid forms.
Paige talks with Leon Adato about “the network ” being an easy target to blame for system issues and how networking is in fact a shared responsibility between Dev and Ops. Leon shares unique experiences from 18 years holding the pager on-call. Part 1 wraps up with a reflection about learning your first monitoring tool (for Leon this was Tivoli) and how these user expectations and these tools have evolved.
[00:00:00] Paige Cruz: Hi, Paige Cruz here, and I’m so excited to bring you the very first episode of Off-Call, featuring Leon Adato. We actually ended up covering so much ground, and there were just too many gems that I couldn’t bear to edit out, that I’ve decided to split up our conversation across three episodes. You’re listening to part one, where we delve into the benefits of knowing just enough networking as a developer. Leon’s experiences managing 18 years on call, including his worst and most surprising pages, and wrap up with reflections about the ins and outs of learning your first monitoring tool. Enjoy!
[00:00:49] Paige Cruz: welcome to Off-Call, the podcast where we meet the people behind the pagers and learn a little bit about monitoring along the way. Today, I am absolutely delighted to be joined by Leon Adato, a former ex-Relic like myself, and whiz around networking up and down the stack, and today, we’re gonna get into talking about It’s the network.
It’s a common thing you hear developers lob as reasons for glitches and things going wrong in the system. And today we’re gonna unpack what is going on with the network, and is it always the network’s fault?
Never!
[00:01:30] Leon Adato: It’s never the network!
It’s always DNS. Sometimes. Okay, it’s never the network except when it is.
And I think therein lies
[00:01:39] Paige Cruz: We’ll accept that.
[00:01:40] Leon Adato: Therein lies the rub, which is that for a lot of folks, especially DevOps folks who are on the dev-y side rather than the ops-y side, is that it is this black box and they’re not sure what’s going on and therefore, I’ve checked all my stuff so it must be that.
Maybe not, and that’s really the crux of our conversation.
[00:02:00] Paige Cruz: Absolutely. And we still do have network engineers. At many a company I’ve had a whole networking team, so it is and you’re we’ll let them look into it. It’s got to be a network thing. That’s what they’re in charge of, right? But I may be suspecting that it’s a bit of a shared responsibility, something we should on the dev side care about as well.
[00:02:22] Leon Adato: Yeah. Everything in tech is a shared responsibility in the sense that yes, there is such a thing as MTTI, right? Mean Time to Innocence. What me? Ha! We want to keep that very low, but if you stop there and you really just walk away, you’re a jerk. There’s still a problem and you need to you may not have to take responsibility or ownership for it, but you can be part of the problem solving process.
And also I think anyone who’s been 15 minutes knows that a lot of problems are multi element yes, you’re right. These three things weren’t the problem, but this other two things that are being masked by the network, or the database, or DNS, or whatever it is, But once you resolve those things, those other two things that are your responsibility now show up clearly, and you still need to be involved.
[00:03:16] Paige Cruz: Absolutely. I am a big fan of, sharing operational responsibilities, including investigations. And that is a little bit of why I wanted to bring this podcast out into the world. I asked myself a lot, does the world need another podcast? My answer to that was no, for a very long time. Until I reflected back on the privilege that it is that you and I have both worked for monitoring and observability companies.
And that gives us a pretty unique perspective in seeing behind the scenes, the backstage of how you render a metric chart over five years or three years. And what does that mean for every individual data point? There’s a lot of stuff that, when you’re responsible for building and operating a monitoring system that you learn about how to do things better and how to interpret your data.
that was a bit of why I wanted to bring you on. We can give developers the tools to say, I confidently know it’s not the network and here is the data that I used to come to that conclusion. I think your ops and network team would be a lot happier if you came armed with facts and having done a mini investigation.
[00:04:23] Leon Adato: Absolutely, and in turn it around a lot of us have had to work help desk, and we’ve all gotten that call “The internet is down” or whatever it is, and it’s clearly not. But what if a user had called us and said, “Hey, server number 7 seems to be having, 15 full seconds of latency on it. And I think it’s related to interface 5. And also I think that one of the function API calls is running a little slow.” If they had said that, would you say, “Why are you hacking my systems? ” No, you would say, “Would you like a spot on my team?”
[00:04:57] Paige Cruz: Yeah.
[00:04:59] Leon Adato: that for each other.
We on the network team should be that for the database folks. We on the application development team should be that for the the network team should all have an interest or a curiosity in how the piece that we’re responsible for fits into the larger ecosystem.
[00:05:18] Paige Cruz: Absolutely. So we’re going to talk a little bit about your personal time being on the page or on-call. then we are going to flip back into what data do we have about the network and what does it actually mean? Because I do think we suffer from an abundance of data.
And data is not necessarily information. Data is just data. You need a bit of analysis and a bit of context around that. And how do we make sense of this whatever bits per second going across links? What does that all actually mean for us in our end users? Thinking back to your time on-call, and just to confirm, you were not on-call as of this recording.
You have left the life of pagers behind.
[00:05:59] Leon Adato: It feels so good.
[00:06:01] Paige Cruz: Yeah, it is. The water is warm over here. Thinking back to the times that you were on-call, I have to say, it’s not always the most convenient responsibility to have.
Do you have memories of a standout time that you’ve been paged and it’s just been inconvenient or silly or strange.
[00:06:20] Leon Adato: Yeah, okay, so I will mention that there is a top slot of worst pager call ever that I will not be sharing today and I will allow your audience to let their imaginations roam wild and free as to what could possibly have happened to Leon that he wouldn’t even be able to talk about it because I have actually no filter and I will talk about almost everything.
Second place,
Second place goes to a boss who really loved testing whether I meant it when I said I was an Orthodox Jew and, that I really wasn’t on Shabbat. And would find increasingly obnoxious reasons to page me out at those hours to see if, finally, they would say the thing that would cause me to come online so they could say, “Ha!
I knew you were just slacking off!” And and nope.
[00:07:08] Paige Cruz: Gotcha. It’s all been a front. What a strange front to have. No.
[00:07:14] Leon Adato: No, I would routinely turn the pager on and by the way, someone else was covering on-call. It wasn’t like on-call was going-. I had arranged for somebody.
[00:07:23] Paige Cruz: You had a plan.
[00:07:24] Leon Adato: The rest of the team said, ” yeah, he said blah, blah, blah, blah, blah. And, oh, look, the data center’s on fire.” Literally, that was one of the messages. That was obnoxious, please don’t be like that.
[00:07:37] Paige Cruz: Yeah, this is what not to do, manager.
[00:07:39] Leon Adato: Really. The one really comes to mind when you’re talking about festive was the 2am call, I got this more than once, because there were snakes in an attic.
[00:07:48] Paige Cruz: Huh.
[00:07:48] Leon Adato: Now, for context, I was not working in tech, I was actually working for a pest control company that specialized in wild animal removal, so this was a normal call. But 2 a. m., bleary eyed and bushy faced, about 20 years old, I was crawling up into somebody’s attic into the, insulation to go look for a snake that was also crawling through the insulation.
The good part about it is I live in Cleveland, Ohio. We do not really have a lot or any poisonous snakes that are indigenous So I wasn’t worried about [that] necessarily, that even coming through their ceiling was an option. it was really just a matter of finding and then trapping, the snake who really probably didn’t want to be there any more than I did.
[00:08:33] Paige Cruz: Totally! Something related to on-call is the thing I’m being paged for an actual emergency?” And I have to suspect that snake had been having of the time of their life up in the attic, and What was the rush? Why could this not have waited till 8 a. m. Till 10 a. m.?
[00:08:48] Leon Adato: I struggle with that one because if someone said there’s a snake in the attic to me, I would be hard pressed to feel like, oh that can wait.
[00:08:57] Paige Cruz: Okay. Okay.
[00:08:58] Leon Adato: Who among us hasn’t watched snakes on a plane? Who among us hasn’t worried about… I can understand.
[00:09:06] Paige Cruz: If you know their location, you want to get them out before it has time to hide.
[00:09:10] Leon Adato: Before it has time to go somewhere else, like the bathroom or whatever. Yeah, I can, I sympathize.
[00:09:15] Paige Cruz: Interesting.
Okay. Wow. The 3am page, it’s almost like a joke at this point that, okay, the pager wakes you up and it’s 3am. And sometimes it really does happen. It does. But I’ve never been paged for a snake. I have a feeling you might take the top prize for best worst page.
[00:09:33] Leon Adato: I hope I do. I hope that is honestly the worst thing anyone says, because obviously there are worse. And I hope that this is the absolute worst one, you and all of your guests ever have had to deal with.
[00:09:45] Paige Cruz: I’ll reach out at the end of the year with the Pagies, the awards that I just made up.
[00:09:51] Paige Cruz:
I have a hypothesis that a lot of us build our understanding of monitoring and observability and system health really through the lens of the first tool we were introduced to. Because everything after that you compare it. How is it like New Relic? How is it like Prometheus?
Do you have fond memories of the first monitoring tool you used? What was it?
[00:10:11] Leon Adato: That’s such a loaded question, do I have fond memories of it? Ignoring ping and Traceroute, which I think are a lot of people’s first experience with the concept of monitoring– how do I know something is there or up or not?
The first one I ever used, and will be honest, okay, the gray on my hair is earned, I’m 57. I’ve been in tech for over 35 years now. I started in IT when Windows came for free on twelve 5 1/4″ inch floppies.
[00:10:40] Paige Cruz: The floppy era?!
[00:10:41] Leon Adato: The hard disk, floppy disks. My first monitoring tool was Tivoli, which when, I started using it, had already been acquired by IBM. which made things a little bit interesting because what people don’t recognize is that Tivoli is actually a drinking company with a small software problem.
[00:10:59] Paige Cruz: When you go to Planet Tivoli, which was their convention at the time, the bar opened at 10. Oh my.
[00:11:07] Leon Adato: There’s a whole other story going on with that. Tivoli is really basically 15 Perl scripts in a trench and an agent, like a super agent that could do anything.
And that was it. So if you wanted to do something else, you had to code it. You either had to take the existing scripts and modify them or whatever. So do I have fond memories of that? I like Perl. And so I really enjoyed learning how to become a better Perl developer. Many people who are developers today are probably feeling waves of nausea coming over them as I actually say the word Perl and developer in the same sentence.
I apologize for that, but I had a good time.
The interesting thing about that first tool you use is that you have nothing to compare it to, and so it seems pretty normal. The most outrageously impressive workflows seem totally
[00:11:54] Paige Cruz: Standard. Business as usual.
[00:11:56] Leon Adato: An example I use is that the first editor I ever used was Vi. I was 16. It was used as the default editor on a bulletin board system I was on. And so it was that, or X, or a franken-app called Fred. And so I picked Vi and wq!, of course, that’s how you save and get out. That makes perfect …right, quit, and out.
[00:12:20] Paige Cruz: So intuitive. Absolutely intuitive.
[00:12:22] Leon Adato: We really have no basis for comparison and Tivoli’s okay, this is how monitoring and software distribution and inventory- this is how it works. It wasn’t until I can look back on it from other tools and say, “Hey, that was an absolute horror show. terrible.”
But it really was top of its class.
[00:12:43] Paige Cruz: I think that’s what we forget when we look back on the past, you have to take into the context of what else you had going on and where tech was. And it has not always been cloud, containers, microservice, sprawl everywhere.
We really built all this up from, what sounded simpler times.
[00:13:01] Leon Adato: Yeah, it was a simplistic time, if nothing else. It was top of class, it was, the best you could buy, and it cost, no exaggeration, a million dollars.
[00:13:11] Paige Cruz: So, monitoring being expensive is not new?
[00:13:13] Leon Adato: Oh, no.
Part of that was IBM, just wanting a million dollars for things. And you could only run it on on a AIX servers because of course you could.
[00:13:21] Paige Cruz: May I ask what a AIX is?
[00:13:23] Leon Adato: It was IBM’s custom version of Unix that ran exclusively on IBM hardware. I mean the million dollars was for stuff —
[00:13:34] Paige Cruz: So it’s a whole ecosystem, vendor lock-in this is what I’m hearing, not new.
[00:13:38] Leon Adato: No. In fact, we’ve gotten so far away from it now in the course over time that what we have now is minimal compared to if you were a Sun Spark works running on Sun systems, that’s what you are running on and they could charge anything they wanted to the days of CA and Unicenter and stuff.
And I know this is off track, but yeah, lock in today is nothing like lock in yesterday.
Going back to the original question Tivoli, which I was responsible for maintaining, having never used it before the agreement was, “Leon, we know you don’t know this. So we’re going to let you learn on our time and our dime, and you are going to do your best, and you’re going to fix it, and you’re not going to rage quit in the middle.”
It was a nice little partnership, we will forgive the mistakes you make because you’re going to keep on trying, because the company has nobody else.
[00:14:36] Paige Cruz: They gave you that space to learn. What I love about that that’s different from what I think a lot of situations devs find themselves in today, is yeah, you can learn on the job, but we’re gonna fill your time and your sprints so jam packed that just fit that learning in wherever you can, it’s fine.
And you’re lucky to even get a learning budget these days. So I would encourage companies and managers to listen to take some notes, like giving people space and time to learn and importantly say, yes, mistakes will happen and we expect that and that’s okay, because in the end we will all have a better solution.
That’s the approach We really just throw people into monitoring. We hand them the pager after three months. We say, you’ve got to be onboarded. Surely the ins and outs of incident response and our data. Good luck. Have fun.
[00:15:24] Leon Adato: OK, so cautionary tale, the thing I just explained, that they gave me space, was hard won.
They had a year before that brought in consultants to install Tivoli, again, million dollars, plus another half million dollars for the consultancy itself. They had it all installed, set up, and then they the leadership bought into the lie that this is self maintaining. You set it and forget it. It’s just going to manage itself. And you just need teams to do the operations.
So there was a software distribution team. There was a distributed monitoring team. There was an event correlation team, separate teams. Remember using the same framework. And so one team would go in and change the framework level settings, which would completely bork two other teams.
Sure. Not knowing it. And so you ended up three months into it with these inter-nested team wars. of somebody changing the framework and not telling the other person because they needed to get
this done and hoping the other team didn’t blah blah blah blah. The place, the entire system was collapsing under the weight of politics and lack of awareness to the point where they either had to write off a million and a half dollars and go find something else and there wasn’t anything else or they could put myself and it was two other people and say look you’re just gonna be in charge of the care and feeding of the Tivoli system because It’s obviously not set it and forget it.
We don’t want to bring another half million dollar consultants in to fix it. Good luck. It was really nice that they gave us the space to learn. But they had already been to the bad place.
[00:16:55] Paige Cruz: They felt the pain.
[00:16:56] Leon Adato: Yeah, and so managers, hopefully you don’t have to piss on the electric fence first to know that it’s a bad idea.
There are already these cautionary tales out there. Give people space to learn. Give people space to try and make mistakes. Maybe investin a demo environment or a QA environment. Or whatever it is I know it seems like it’s a little bit extra for licensing, but it’s actually not in the long run, etc.
It’s really the right way to go.
[00:17:23] Paige Cruz: A playground, Oh man I almost love the cyclical nature of the trends or just when I hear stories of people starting out their career. I’m like, oh my gosh, I have been there. I have been the person that had to get put on the team because the monitoring system was falling over.
And in a strange way, the suffering kind of bonds. I feel like I’m joining a long line of operators who have been there and done that we’re not alone.
[00:17:50] Leon Adato: a long and august tradition of failure and pain. Welcome. Yeah. To the . Welcome to the family.
[00:17:59] Paige Cruz: so we’re not alone, things I assume are getting better.
I personally had to retire from SRE. I was very burnt out from a few back to back stints at startups, which startups are a great place to get your hands on a lot of technology, often pretty cutting edge but the broad level of responsibilities while maintaining on-call for a small team just really got to me.
I was on-call for six years. How many years were you on-call and how did you manage that? Because I had to tap out. I’m very honest about that. It’s not for me anymore.
[00:18:35] Leon Adato: Yeah, so 18 years altogether, but I’m including the two years of pest control. So 16 years of technical on-call. Rotations is part of desktop support teams, sysadmin teams, monitoring, engineering teams.
my on-call rotation ended in 2014 when I pivoted to start working for vendors and doing, whether we call it developer relations advocacy, or technical evangelism, or like spokesmodel.
[00:19:04] Paige Cruz: Oh, I like that one.
[00:19:05] Leon Adato: It was 16 years and part of the reason why we didn’t feel burnout was, again, it was a simpler time.
Again, 2014, where DevOps was just starting and cloud really Amazon had, I think, S3 containers in 2014, but they hadn’t become the thing that they are today. So most of the work was on premises. Most of the work, we were still very much in the pets, not cattle concept of, and I’m not making this up, we had servers named Hugin and Munin the two crows because they all were subsystems to Odin, the main server.
We had routers that were named after the dwarves in Cinderella. we knew that Odie and Garfield were the two VAC systems and that Odie was down, but Garfield would take over. or whatever we had named our pets and we knew them on a relatively personal and not intuitive but deep level.
So when there was a failure it was a known failure. It was simpler in the sense they weren’t microservices. There were no API calls. So when something failed, the troubleshooting process was fairly well known and well documented because it followed a standard routine. You either work from the back of the box out to the wall or from the wall into the back of the box.
Those kinds of things. So on-call rotations were much more predictable in what might happen. And that probably kept a lot of the burnout from happening because just the pace of what could possibly be wrong was very different. That didn’t make it any less stressful when I was in hour 46 of 38 hours staying awake because email was down.
There were still kinds of things. For a very long time. At every company there were a lot of applications and three religions. One religion was email, and then there were two other ones that depended, like it might be order entry, it might be customer relation, it might be whatever, but there was three religions that could never ever be down, and then there was a bunch of other applications that you had to keep running also and email was down for our company for 48 hours while we tried to bring it back from the dead.
[00:21:24] Paige Cruz: Whoa, okay, simpler times, but incidents have always been stressful. Technology finds ways, and these systems find ways to throw curveballs, and it sounds like there was maybe an implicit agreement or expectation that email was 100 percent available, would you say? It sounded like there was no tolerance from the users for downtime.
[00:21:48] Leon Adato: Yeah absolutely again, I won’t say no internet, but internet was email for a little bit of web browsing. Online order entry wasn’t really a thing at some parts of that, again, that 16 years, which ended in 2014. For a long part of the internet, it wasn’t a thing at all in that it was just a cute little novelty. You had one internet gateway and people would walk out to it to go check AOL or something.
Yeah, it was much less critical than it, obviously, it wasn’t ubiquitous the way it is today.
[00:22:20] Paige Cruz: Well, that wraps up our first episode from getting paged about snakes in an attic to organizations learning that monitoring stacks don’t just manage themselves and so much more. I really, really hope that you enjoyed it. Stay tuned for part two coming soon where we delve into how monitoring relates to the three things businesses and your CFO cares about, ways Leon unwind off call, and getting your bearings with SLOs.
Thank you so very much for listening to Off-Call and a big thank you to our sponsor Chronosphere, the one and only observability platform that gives you complete control so that you can focus on the data that matters remediate faster and optimize cost. Check us out on the web at www.chronosphere.io and check the show notes for the spelling on that.
Cheers!
Share This: