Senior Developer Advocate
Paige Cruz is a Senior Developer Advocate at Chronosphere passionate about cultivating sustainable on-call practices and bringing folks their aha moment with observability. She started as a software engineer at New Relic before switching to Site Reliability Engineering holding the pager for InVision, Lightstep, and Weedmaps. Off-the-clock you can find her spinning yarn, swooning over alpacas, or watching trash TV on Bravo.
Staff SRE | Cockroach Labs
Matthew Sanabria is an engineering leader focused on building reliable, scalable, and observable systems. Matthew is known for using his breadth and depth of experience to add value in minimal context situations and help great people become great engineers through mentoring. Matthew serves the Go community as a member of GoBridge and a contributor to the Exercism Go track. In his spare time Matthew spends time with his family, helps grow his wife’s chocolate business, works on home improvement projects, and reads technical resources to learn and tinker.
Paige chats with Matthew Sanabria, Staff SRE at Cockroach Labs, about memorable moments from his 11 years on-call, the warning signs of burnout, and what developers and operators can do for each other to improve database observability for everyone!
Paige Cruz: And welcome back to Off-Call, the podcast where we meet the people behind the pagers. Today I’m delighted to be joined by Matthew Sanabria, a staff SRE at Cockroach Labs. Today we’ll talk about Matthew’s life on-call, his life off call and some tips and tricks he’s picked up along the way.
Paige Cruz: To kick things off, can you think back to the first monitoring tool you used and how do you feel about it today?
Matthew Sanabria: That’s a great question. The first monitoring tool I used was Nagios not Na-gee-ous Nagi-oss, I dunno how people say it. It was basically like a tool that like let you write checks that basically responded in like red, yellow, or green. Green being success, yellow being warning, red being like, alert. And you would write those checks and that’s how you would get like alerts. It was very…interesting..
Paige Cruz: So was it more than a health check or just that ping level functionality?
Matthew Sanabria: You can write any Python check I think at the time. So you can just write an arbitrary Python script and the Python script will return like basically zero or like non zero. I think like a negative return would have been like red or something. I forget how the semantics, but you can do any arbitrary script.
Paige Cruz: That’s pretty flexible! Okay. If you had to summarize- why do people have such horrible feelings and associations with Nagios? Cause on the face of it, customizable scripting for your monitors, red, yellow, green, what more could you want?
Matthew Sanabria: Just look at Nagios UI, you’d be like, what is going on?
Paige Cruz: Some interesting choices were made on the front end.
Matthew Sanabria: It was just very rudimentary. They obviously improved since then, but it was just like a raw HTML page of like links to checks and like green, yellow, red sign next to them saying okay, warning or critical.
I was just like, this is great.
Paige Cruz: Okay, like my first app project, my first HTML.
Matthew Sanabria: Exactly. Yeah. No CSS. Just, yeah, exactly.
Paige Cruz: Love it. We have come quite a long way and thinking back to your life on-call, I believe you’re still on-call. How many years has it been since you’ve been on the pager?
Matthew Sanabria: Yeah, I actually covered the pager for my coworker this morning. That’s fun. I’ve been on call since about 2015. Yeah, that’s when I first started to be on-call and I remember like joining the company and they’re like, “you’re going to participate in an on-call rotation.” And I was like, cool. What does that even mean? I don’t know what that means.
Paige Cruz: Oh, it’s quite the responsibility once you dig into it. Do you remember that first shift that you did?
Matthew Sanabria: I don’t remember the first shift because I think that I was being shadowed and guided. So I don’t remember it being particularly bad.
There was one year though, in a different job where I was on-call once every three weeks, technically once every two weeks, cause the third person was in Europe and wasn’t available during the time. That was miserable and that was for a whole year that I was on-call like every two weeks, it was horrible.
Paige Cruz: That’s a recipe for burnout. Even if it was the world’s most reliable system the toll that it takes to have to be alert just in case something happens is a heavy one to carry.
Matthew Sanabria: And you can’t plan your life well when your rotations are that short, you can’t plan your life.
I remember I was at a movie and I got paged. I was like, Oh, I’m on-call. Why? I was just, I was literally just on-call. What happened? Yeah, it got to the point where it was not fun.
Paige Cruz: Not at all. And with those small rotations, people say, Oh, just get an override. And you’re like …everybody on my team is tired of being on-call.
They plan for those two weeks that they have to themselves. They jam pack those. So no, folks aren’t always available to just pick up a shift, even if you really need it. I’ve been there when there’s crickets in the Slack room, when I’m like, “Hey, just need three hours.” “Okay. One hour?” It’s not fun.
I think you need a minimum of six folks in a rotation, but yeah, what’s been your experience?
Matthew Sanabria: At that time, we had 10 engineers on the team, but only three of us were on-call so that was like kind of a management failure in that regard. Cause none of those people were new enough to like not be on-call they were all at the company, at least for six months.
Paige Cruz: A bit of an anti pattern.
Matthew Sanabria: Definitely an anti pattern. I think I think five or six people is a minimum that you would need, right? If you’re on-call just under once a month, that’s where the threshold is.
So you should be like over once a month.
Paige Cruz: Oh, totally. Speaking of getting paged at the movies, do you have any standout times or places that you have been paged? My infamous story was getting paged at a concert hall where an orchestra was playing along with Coraline the movie.
Matthew Sanabria: Oh, nice. That’s a good movie.
Paige Cruz: People at the theater are not into getting interrupted by cell phones, I’ll tell you that.
Matthew Sanabria: Oh yeah, I know Theaters are legit. I went to go watch Lord of the Rings with the live orchestra and people don’t want you moving they want to be in the moment and the last thing they want is like a phone alarm going off.
Paige Cruz: Oh yeah, I felt every single person’s eyes on me as I was like running down the aisle because you can’t silence PagerDuty, right? The whole point is that it overrides all of those settings. So I’m like running with my phone, just like going off whatever I had set up. And I was secondary, which is why I went, I guess I want to say I was very responsible, but when it came to secondary on that rotation, you rarely got called in. So I felt, okay, I could take these three hours and no!
Matthew Sanabria: That’s a good point. The secondary rotation is an interesting culture. Cause it’s it’s, there’s an unwritten rule of don’t page your secondary, right? Make sure it doesn’t get to the secondary. But then it’s also then what’s the point of the secondary? Just to have redundancy? There’s a whole weird subculture there that we could get into.
Paige Cruz: Oh, totally. If not the theater yeah where have you been paged, where you thought, “Oh my gosh, this is just so not the time.”
Matthew Sanabria: This was like my own fault, but I was at like a bar with a couple of my co-workers and we were drinking and we were going out on Friday night or whatever and I got paged for a pretty severe incident there. And I’m like I’m like three beers in… you’re all three beers in. There’s nobody here that’s not any sort of beers in. But I have my laptop. So you all want to fix this together? So I just took my laptop –
Paige Cruz: There’s wifi at the bar, right? Just get that password.
Matthew Sanabria: I was like here, I put my laptop on the bar and we all gathered around and just had a good time fixing the incident. So that was a fun time.
Paige Cruz: We haven’t had a fun, happy, joyful incident yet on the channel. So I love that story. It is always better to troubleshoot together. I find the most anxiety came from when I was the only one paged and it maybe was before I had brought folks in and you’re feeling the weight of production on you. But it’s never a solo activity. It should not be.
Matthew Sanabria: Yeah, like generally that incident was pretty straightforward. Fixing it. There’s only been like two incidents that I was like, “Oh my God, this is enough already.”
One was an incident that I caused where I deleted all the DNS records for our public company, like for our, like our public DNS records for the company and I was like, as soon as I hit enter on that damn command, I was like, “this is gonna-“, I turned to my coworker next to me, I was like, “Hey, this is going to be an outage. There’s going to be an outage. Can you get the channel ready?”
Paige Cruz: It’s happening.
Matthew Sanabria: Yeah. And I immediately started fixing it. But once I deleted the DNS records, there was a period of time for him to, repropagate whatnot. So it was like a 15 minute outage or whatever. But I was like, damn, I did this.
Paige Cruz: When it’s your keystroke that is the cause… at least you knew, it’s not like you hit enter and then went to go get coffee. You were there, you responded, but that must have been a moment of panic.
Matthew Sanabria: The other one was, I was on an 11 hour call with customers. And really low effort from my part, because I was really just they’ve forgotten their encryption password for the application. And without that, the application won’t boot. Cause it, the very first thing it tries to do is like decrypt the necessary keys and they forgot the encryption password. But they had an older password, so we were just like going restoring old database back backups in reverse chronological order to see if an older backup had the older password being used.
Each restore took 30-40 minutes and we had to do that a number of times. Then 11 hours later I was like,” Hey, person on the other call, person on the other side, can you like, you said that you had like old files that had the old password in it. Can you just open up your trash bin on your MacBook real quick? I just want to see if there’s anything in there. And sure as heck, there was a file in there with the correct password. We use it, the app booted up on the normal, like on the latest database backup and everything was fine.
I was like, “11 hours of my time!!!”
Paige Cruz: That is a marathon and to be on a call with a customer where you’ve got to be professional. It’s a little bit different than when you’re just working with coworkers. And surprise twist, storing a password in plain text locally on your computer saves the day.
Matthew Sanabria: Yep. Congratulations.
Paige Cruz: That is not a lesson I think people should take away from that story. But that is quite interesting.
Matthew Sanabria: We have just influenced all our listeners to now say cleartext passwords are okay. We failed. Nevermind.
Paige Cruz: Yeah, shut it all down.
So you’ve handled some pretty hairy incidents in your time.
What advice do you have for on-call newbies, those that are new in their career? And then we’ll chat through when you’re joining a new company. What’s your advice for getting up to speed?
Matthew Sanabria: The tl;dr advice is don’t be afraid of the pager. It’s going to help you better understand the system and ask for shadow time, right? Don’t just go on like primary on-call without being ready. Ask for a shadow time where the page comes to you and the primary and you’re just like the shadow receiver. You’re not responsible for fixing it, but you’re responsible for syncing with the primary and like shadowing along.
So that’s my advice.
Paige Cruz: Very important!
Matthew Sanabria: I think too many people are afraid of the pager but it’s going to help you get better at your systems.
Paige Cruz: Absolutely. I think shadowing is so important because there’s a lot of things that don’t get documented in a company’s incident response process. Or sometimes your coworkers know the three different tools that logs go to, and you’ve only gotten access to one or something, that shadow week really shows you what you need to be able to actually participate in the response.
Matthew Sanabria: Agreed. And it takes time to read the playbooks or the runbooks or to read the incident process. Even if you’ve already read those things when a page is happening live and you have to respond to it, it takes time to do that. And you don’t want to be nervous or during that first time so getting that under your belt is important.
Paige Cruz: What kind of on-call onboarding or I suppose incident response onboarding have you been a part of?
Have you tried the tabletop exercises where it’s just in Slack and it’s a fake incident or have you done the full blown game days? What’s been your experience?
Matthew Sanabria: We’ve generally done what we call a wheel of misfortune where like you have someone running a fake event saying, “okay, you get paged for this, ask me questions and I’ll give you answers.”
And you’re saying, “oh, I would find out, I would use dashboard X to find out the CPU for Y.” And he says, “okay, the CPU is showing 99%…”, right? And they would just go through that scenario with you so you can like troubleshoot the issue. They are fun. They’re really good for like thought experiments because you can catch like those weird little, like technicalities. ” Oh, I would type kubectl get pods.” And it’s you missed the -A and now you only call pods in your default namespace, but your app runs over there so you didn’t see them. So it’s it gets those little technicalities out and I love that.
Paige Cruz: Totally. And was there one other onboarding example you had? Game day?
Matthew Sanabria: I haven’t had a team that was willing to break real production. I’m more than happy to. You want me to, let’s go do it. I’ll be fine to do it. So we like broke like non-production and treated that as like the incident and that was fun, but it was a little less helpful. Like it was helpful to go through the process, but it was a little less helpful for like troubleshooting cause prod is never the same as non-production.
So I almost wanted to just break a real prod, it’s hard getting people to agree with that.
Paige Cruz: That’s very true. The only time that I’ve been able to make that work was when I was doing greenfield development for a product that was in early, early beta with just like two customers where we had said, “hey, this is to help us harden the system. It’s going to be a better solution for you. So let us do this for a few hours.” So find whoever’s building the new stuff and go cozy up to that team.
Matthew Sanabria: Exactly.
Paige Cruz: Alright so we talked a bit about your life on-call, but as a nod to the podcast title, obviously your life isn’t full of computers and working and responding to pages.
What are your hobbies and interests when you are in that precious off call time?
Matthew Sanabria: You’re seeing some of it in the background here motorcycles, cars, those things I do driving my car, driving my motorcycle, working on them when I have time.
Family time is super important, right? My wife and I spend a lot of time together, especially with our dogs and whatnot. And I am helping her open up her chocolate shop. So she’s a chocolatier and we’re now getting a physical location. So we’re doing that.
I do a lot of home improvement projects. I like renovated most of our house, whole kitchen, bathroom, all that stuff. So there’s always a next project.
Outside of that, just reading, writing, little tinkering with projects, working out, things of that nature. I have a good list.
Paige Cruz: That’s a really healthy balance – because some folks are like, no tech is my hobby and my job. But I think those of us who have been through some gnarly on-call rotations say, in my off time, please get me away from the computers.
I spin wool at a spinning wheel, like old Rumpelstiltskin that is, pretty low tech.
Matthew Sanabria: I’m like, in between on those, right? There’s days where I’m like, I don’t want this computer. Please get rid of this computer. And there are other days where it’s okay, done with work. I want, I’ve been meaning to do this to learn about this other thing. Let me go do that. And it’s a balance. When I’m in a job that’s burning me out a little bit, it tends to be the former where I’m like, I don’t want to see this computer because I’m going to bury it in the backyard. But when I’m at a job where the work can be stressful, but it’s a good stress, like you’re just busy. Then I’m like, okay, let me switch contexts to do this other thing that I was interested in.
Paige Cruz: Yeah, it’s healthy to have those options to lean on because you don’t always control your work stress or the work environment.
Matthew Sanabria: If you find yourself taking the work stress home, that’s usually an indicator that like something’s wrong and you should figure that out, right? Maybe you should leave that job or maybe you should take time off or whatever it may be that there’s a signal there.
Paige Cruz: Yeah, switch teams? That was something I always did when I felt like I was hitting a ceiling of the limits of my growth or, and/or a lot of operational toil that did not look like it was gonna have the path cleared for me to fix it. I said, “oh, I’ll switch teams. I’ll keep my health insurance. I’ll keep the context of the same company, but let me go do some work over there.”
Matthew Sanabria: No, I totally agree.
Paige Cruz: That’s a great point about bringing work stress home because even after an on-call shift, when you get to hand that over to the next primary, it doesn’t always mean your body or your mind says, “Oh, okay. I’m totally relaxed now.” What are your strategies for unwinding, whether it was a very stressful incident response or just a really rough on-call?
Matthew Sanabria: If I get like a really bad on call shift or something, I will not come into work the following day or two, right? I’ll just take time off automatically and I’ve never had a manager push back against that. Some encourage it, some don’t encourage it, but they don’t prevent you either. But I do that. I’m like, “listen, I was paged like 10:30 last night. I’m not coming in tomorrow. I don’t care what you say. Like I’m just not coming in.”
So I do that. And then there’s also the cathartic getting together with some of your tech friends and just, letting it out a little bit, right? And just be like, “I had to deal with this, and I was-” you have to like get that stuff.
Paige Cruz: Little vent sesh.
Matthew Sanabria: And vent, yes. You have to do that. So I do that with some of my friends. I hop on Discord with them and be like, you’ll never believe what happened and we just have that.
I can’t really talk about it with my wife, because she’s as far away from tech as probably possible. She’s very much about chocolate. So she won’t understand what I’m talking about so I go to like my tech friends that do understand and we have a good vent session.
Paige Cruz: Having that trusted circle is very important. That is what will get you through and help you have this career sustainably because there’s nowhere I know that does operations perfectly or has perfectly stable systems.
Matthew Sanabria: It’s healthy, right? You’re not getting together talk smack about the company and this and that. You’re not like saying, I hate this. You’re just saying yout were frustrated with that and it’s healthy to get that out. Because if you don’t you’re just gonna take it to work with You the next time and you’re gonna take it out on your co-workers or on whatever and you’re gonna you’re just gonna hurt yourself.
Paige Cruz: Getting that out externally will help you be a lot more composed during the incident debrief or the incident review because that is not always the time to get into the drama, but it’s a time to really think through, okay, how did this incident affect me? What were the things that impeded my ability to do my best work here? And sometimes the emotions can overshadow that for me.
Matthew Sanabria: I had a pretty wild shift once, where the page itself was wild. There was a rogue employee that like deleted data and we got paged. It wasn’t on our side, it was on the customer side that their rogue employee deleted data and left like files.
Paige Cruz: Malicious Insider!
Matthew Sanabria: Yep. So we were there trying to help them restore their system. And we’re going through the files to work on the system. There was files like that were titled like “for HR”, “for Alex”, ” screw you, Alex”, like bunch of files there. Speaking of the composure, you’re trying to remain composed because you’re like, Oh, I’m on a customer call. This sucks and it’s serious for them. I can’t be laughing. Or as much as I’m like, we all wanted to just rip production down and walk out – as much as I can feel that I don’t want to like laugh because this is serious. Fast forward to now, that happened, three years ago. Fast forward to now, that person got actually convicted and had jail time for that.
So I was like, oh!
Paige Cruz: Good
Matthew Sanabria: For sure good.
Paige Cruz: There are not enough consequences for bad actors in tech it feels like. I am not a fan of incarceration but I am a fan of people facing consequences for their bad actions.
Matthew Sanabria: Oh I was there and thinking about it, had I been in a bad mood or something that day and I said something weird. I could have been pulled into that crap, right? Cause I could have been summoned or whatever it may be. And I’m like, you know what? Maintaining your composure is very important so that you stay outside of all that stuff. And yeah just an FYI.
Paige Cruz: Don’t want to be called for depositions or given testimony or anything.
Matthew Sanabria: I do not want to do that.
Paige Cruz: Nope. Nope. And this is also why I always tell everyone, know how to mask sensitive info in your logs. Having worked at some monitoring vendors, it is not fun when folks accidentally send us their PII because not only is it our customer’s problem, it then becomes our problem. And there is so much paperwork. So just follow the laws.
Matthew Sanabria: There was a request, like during discovery phase for that trial, there was a request that came in saying, “Hey, do you still have logs from this session?” And we’re like, “no, we don’t they delete after X period of time, blah, blah, blah.” So I’m good. I don’t want any of it.
Paige Cruz: We follow auditing and compliance rules. There’s a reason we have our SOC 2 certification. I love it.
Matthew Sanabria: Let me get out the the script that I have to read you, please reach out to our friends.
Paige Cruz: Oh man. Wow. So we’ve covered a lot about on-call, off call. Turning to your work with databases and at Cockroach, I want to talk through the telemetry that data stores emit. What about databases should developers be totally familiar with? If they’ve been given access to multiple different tools and told to keep their systems up and reliable, where does the database fit in?
Matthew Sanabria: The database is definitely treated as more of an infrastructure concern at most places, right? Because the primary use of the database is that like it has enough CPU, RAM, and disks to do its job, right? Slow disks lead to slow database writes, which leads to latency on the clients and blah, blah, blah, blah, blah. And an overloaded database can’t process requests. So it falls behind and blah, blah, blah. So it’s going to be very infrastructure-y monitoring for the most part, especially when you’re using databases that you don’t write like right off-the-shelf database, like a Postgres or a MySQL. That’s the first thing I’d recommend is make sure that you understand the health of your database system at an infrastructure level. That’s like the first thing.
Then from there, you can make your alerts and whatnot. At Cockroach specifically, we’re also responsible for writing the database, right? We build Cockroach DB so there’s more app level stuff where you can emit more structured logs from the application, or you can do more tracing there, or even have control over emitting more metrics. So we have more of that app level stuff, but it’s a mix there.
Paige Cruz: Know your infrastructure. That’s always, that’s I think an evergreen tip or advice. Whether your database is maybe running in Kubernetes, whether it’s running in the cloud, or you’ve got it on-prem. It’s running somewhere, and you should understand where that somewhere is, and what it means.
Matthew Sanabria: Yeah to look at CockroachDB, it emits over 1500 different metrics.
Paige Cruz: Whoa!
Matthew Sanabria: I’m not looking at 1500 metrics, right? Let’s be real. I’m not. And different teams that work on different parts of the app care about different metrics. But for me, there’s like a core set of metrics that I care about out of those 1500. Those plus the infrastructure metrics plus shipping your logs somewhere so that you can investigate with them. That is going to get you like 90 something percent of the way there, right? That’s the tl;dr because most databases won’t give you access to change their code, right? You’re just using off-the-shelf systems.
Paige Cruz: Yeah and 1500 metrics is a lot. I try to tell folks auto-instrumentation is a blessing or sort of the out-of-the-box whatever is provided to you by whoever wrote the database or the component you’re using. However! If you don’t actually go through and figure out what you need to drop and what you actually want to keep, you’re going to end up paying for a lot of data you don’t look at and you probably don’t need.
Matthew Sanabria: Exactly. The database will give you some good signals, right? It’ll tell you like the storage size of databases. It’ll tell you like what tables you have, it’ll give you some metrics that you can query either with a SQL interface, or maybe it gives you like an endpoint that you can hit, but you can write your own apps too. I try to tell people when you’re working with a database, write a small app that probes the database, like your own health checker, so to speak. Do a round trip to the database every now and then measure that latency, and get your own telemetry. That’s more representative of your workload.
You can’t always relies on database stuff, right? You have to see it from the app’s perspective too.
Paige Cruz: Oh, totally. Your point about databases primarily being an infrastructure concern really hits home and resonates for me. So what can app developers do to make the folks running database infrastructure lives easier? What could they do for us? Or for you? I haven’t been running databases, I’m off call!
Matthew Sanabria: Pun intended. That’s a good question. I think app developers just need to be aware of how many connections their app is using to connect to the database. There’s two things: don’t just arbitrarily add more connections to the database, right? Don’t assume that, oh, I can just keep adding more connections and it’ll be fine and everything will just happen.
Paige Cruz: Connections are cheap.
Matthew Sanabria: Yeah. Don’t do that. Have some sort of like pooling logic in your app that like brokers connections around, blah, blah, blah. So implement that if you can. And then also optimize your queries, right? As much as possible. And yes, that requires measuring your queries and measuring like how long certain things take, but modern database drivers have the ability to add telemetry and tracing to it. Do that, measure what you’re doing, optimize the heavy queries, because not only is that going to make your app respond better, because now the query is getting back faster to you, but it makes the database do less work, right?
Less load into memory, less disk shuffling, all that stuff. So you’re not contributing to like poor system performance if you’re just on top of your queries, right?
Paige Cruz: I love that. So app developers take a look at these things. Go do some deep dives into what your application is doing. On the infrastructure side, what do you think infra folks could do to make things easier for app developers? What can infrastructure folks do? What’s our role in this?
Matthew Sanabria: You can get pretty far on a single database, honestly. You really can. Most companies can get really far on a beefy single database that’s backed by like good SSD storage. You really can get really far with that.
If you have to exhaust that, then you should be looking at maybe doing like a read replica, right? To offload your reads to this read replica if you’re okay with reads being slightly out of date, right? You could also do some like load balancing or some sort of distributed SQL where this is where CockroachDB is really good because it runs as a distributed SQL database already. So you already have this concept of multiple nodes and essentially you essentially get sharding for free there for the most part. So that’s really something you can do.
Otherwise, outside of that, they make like things that you can put in front of your database that brokers connections and caches things a little bit. You can see if you can see about implementing that. But you can get really far just on optimizing your database, like the settings.
Oh, that’s another thing I didn’t talk about database settings. You don’t just use the database as it comes out of the box. Change the settings, they’re there, read up what they do, change that. And that’ll help you. .
Paige Cruz: So both sides, when ops and infra shows up to do their part and app developers, that is where you get magical lightning, fast queries, beautiful user experiences happy developers, happy operators.
Matthew Sanabria: And it’s not even it’s not even you need a PhD in this to do it either. We read these articles from people that say things like how I reduced our database load by 30 percent by, blah, blah, blah. And you read through the article and all they did was change some settings. And you’re like, “wow. And it’s like, how did they know that?” They read the settings and they understood what they did and they just turned them on and they measured the performance and you can do that too. But I think our industry as a whole, like not even just databases, just the whole industry has a pretty bad habit of using tools out of the box and doing zero like configuration or customization to it. Terraform was like this too.
When I was at people were like, “why isn’t Terraform- why is it so slow?” Cause you’re using the default parallelism of 10. You can bump that up, you can configure the thing. I think more people should spend time configuring those things and you’ll see ” Oh wow. I can get some good performance out of this.”
Paige Cruz: Totally. I think a lot about like performance in the observability space of the agents or auto-instrumentation or collectors. I think what gets folks, they get it out of the box, it’s so much work to get it up and running, deployed in their system through all their processes that if it’s good enough, then they don’t ever go back. It’s okay, I’m so tired from getting it set up and I need to run to three other different projects, but really. If you’ve done that, you’ve introduced something new, or you’ve had a database running for a while, then, once it’s been used and you have actual usage patterns and traffic patterns, then it’s a great time to go and optimize.
If you’re introducing something new, set a little timer, set a little reminder for two months down the road to come back and look at how you could make it better.
Matthew Sanabria: Correct. Oh, one last thing that app developers can do, you could just emit more information about what you’re doing with the database, right? Don’t just say connection failed to the database or like query error. Give me your fingerprint for the query, right? Give me the transaction ID that you had. There’s more information that you can emit when these things go wrong so that you can tie the application level error to the database level error, right? Don’t just say Oh I failed the connection at 3:22 PM go figure it out. Tell me. What database you’re trying to connect to give me more information.
Paige Cruz: Give me the info and give it up front, because the thing is, you’ll need that information to figure it out, and you could get it from the get go, or you could go on a wild goose chase through multiple repos, and looking at config files and it’s not fun to do that. Like you, I would rather have that information at my fingertips.
Matthew Sanabria: Exactly.
Paige Cruz: So thinking through your span of running databases and operating them and monitoring them, has database monitoring gotten better over time? Is it better today than five, ten years ago?
Matthew Sanabria: I would say it’s better just because like technology has evolved, UIs are nicer, right? Logs are probably like updated to be more informative, things of that nature. In general, it’s not like there’s been a change that’s made database monitoring better it’s still the same as it was back in the day. It’s just the underlying hardware has changed. The workloads hitting the database have changed. So there’s just more information to capture now over time than there was back in the day. And I think a lot of engineers across the stack, software, ops, whatever. I think they, they believe that they have to capture all this information to be successful in monitoring their database. And it’s no, you don’t have to capture everything. You just have to capture everything that matters. And if you find things that don’t matter, don’t capture it. Like that’s just really what it is. If you’re an app developer and you’re emitting something that’s just unused, like not useful, stop emitting it. It’s not useful anymore.
Paige Cruz: That’s money and storage resources. It’s not free. It is not free or cheap to monitor today’s systems.
Matthew Sanabria: We’re so afraid as an industry to, to change the systems that we built.
Paige Cruz: Or inherited, to be fair.
Matthew Sanabria: It’s like that log line, somebody added it to fix an issue or to like to get more information about an issue. The issue’s been resolved. We don’t need that logline anymore. Get rid of it. It’s just noise.
Paige Cruz: Same with your bad alerts.
Matthew Sanabria: Yeah. And your bad alerts. That’s a huge thing. Get rid of your bad alerts, your bad telemetry data. There are companies out there that say ” Oh, we allow you to capture everything” and that’s great. You should be able to capture everything in a sense, but don’t use that as an excuse to capture everything and never eliminate the noise from it. And never get rid of that stuff.
Paige Cruz: It can become a real tragedy of the commons. I’d say if you are on-call and you hold a pager, you have the authority and the responsibility to make that experience great. That could be deleting crappy monitors. That could be dropping tags or labels that you just don’t need, or total whole metrics or log lines. That power of the pager, use that as leverage. Your manager and your VP is not on-call most likely, you are. So you got to own that part of it.
Matthew Sanabria: There’s two things about this too, that, that kind of hurt here. One, there’s a lot of people that like. When they see that it’s a database, they give you the excuse, “it’s a database. We shouldn’t touch it. It has to be stable.” this is a good metaphor that I like to tell people.
Paige Cruz: I love a metaphor.
Matthew Sanabria: Stability is not, not using something, right? Think about architecturally in the world, you go to these old buildings that have been standing for millennia. We would argue that they are stable, but you’re actively using them today. You’re walking on them, you’re whatever, you’re taking pictures in front of them. We look at the database and we try to say, “Oh, it needs to be stable, so don’t touch it.” Something can be stable and very well used, right? And that’s the metaphor I like to give people is like the architecture one. That concrete thing that they built a thousand years ago, its stable and it’s in very much use today. Why are we treating our database as if it’s some sort of fragile thing that’s going to fall over with one connection? Come on, that’s not stable.
That’s one thing that I try to talk to people is we got to stop using that as an excuse saying it’s a database. Don’t touch it. If you’re afraid to touch it, you don’t understand it. If you don’t understand it, why is your business relying on it?
Paige Cruz: And our last question, if you’ve got time. Kubernetes came and caused quite the kerfuffle in the industry, how has the rise of running a kajillion containerized microservices in Kubernetes impacted working with databases and operating them?
Matthew Sanabria: That’s a good question there’s two sides to this. One is like your app is running on Kubernetes and it’s across many containers in some sort of microservice architecture. So now for the same user request to be served now that has to traverse maybe N number of containers. You’re making these network hops across containers to serve this request, and it’s harder to observe that. When a request has to traverse the network, it’s harder to observe what’s happening and this is giving the rise to distributed tracing and blah, blah, blah. It is difficult to observe that because it’s harder to wrap your head around ” okay this request came into this function in this codebase, and then it made a network request to that function in that codebase. Now I have to pull up that codebase” and it’s hard to just observe it. That has impacted anything that has to do with Kubernetes.
On the database side, specifically, if you’re going to run your database on Kubernetes, that means it’s already running at a layer above the system, right? It’s already running on a VM and inside a container. So there’s already a layer of abstraction between you and the underlying hardware. No matter how good like hypervisors are and container runtimes and all of that, there’s always going to be latency introduced with that, right? So you have to understand that.
More importantly, Kubernetes provides you disks to your containers, usually like using network-attached storage. If you’re going to attach a network disk to your database and use that as your datastore now reads and writes to, and from that have to go through those like layers of abstractions, again more latency. So just keep that in mind when you’re working with Kubernetes and working with containerized applications, that the more layers of abstraction you have, there’s more latency introduced in those layers and more places for incidents to occur.
Paige Cruz: The complexity has increased yeah, I’m still like did Kubernetes make things better for the world? Jury’s out, and that’s probably too unfair of a question to pose. But it definitely has made things trickier to figure out what the heck is going on and where things are. Getting called to and fro.
Paige Cruz: So with that thank you so much for joining me today. In the show notes, you can find Matthew’s blog, of course links to Cockroach Labs, if you want to check out all the great databases that Matthew is on-call for and make use of his services.
Matthew Sanabria: Y’all should definitely check it out.
I do some vlogging, some YouTube. And yeah, if you want to check out CockroachDB, feel free to. It is a Postgres compatible database. So if you’re a Postgres user, you can feel right at home with Cockroach.
Paige Cruz: Check it out. It’s got, the most interesting tech mascot I have come across. Win for differentiating.
Matthew Sanabria: Yeah, I should have led with that.
Paige Cruz: And that concludes today’s episode. Thanks for tuning in. I hope you had as much fun listening as we did recording. From exciting tales drawing on Matthew’s 11 years and counting on-call, To advice for de-stressing after a rough incident, and practical tips for both dev and ops folks to level up database operations and observability, this episode was jam packed with goodness.
Do yourself a favor and check the show notes for links to Matthew’s blog, his YouTube channel at SudoMateo featuring a current series all about Go, resources to learn more about CockroachDB and CockroachLabs, and if you happen to be in the New York area, the Platform Engineering New York Meetup Group.
Finally, a big thank you to our sponsor, Chronosphere. We’re not like other observability vendors. Seriously. You won’t hear us advising you to, “just send us everything”, because we understand that not all telemetry is equally valuable, and don’t profit from hoarding your telemetry. That’s why we designed and built Chronosphere with control in mind.
From controlling metrics with powerful aggregations, drop rules, and cardinality cutting measures, to controlling traces with intelligent and approachable sampling using baselines, and controlling logs with the power of processing and filtering within telemetry pipelines.
Chronosphere is here to help you dial in the signal to noise ratio within your observability data. To see this awesome control plane in action, check out the walkthroughs on chronosphere. io.
Until next time!
Share This: