ACF Image Podcasts

Podcast: Ops Teams are Pets, Not Cattle

On this episode of the Stack Overflow Podcast, sponsored by Chronosphere, hear from Senior Developer Advocate Paige Cruz, on how teams can reduce cognitive load on operations, the best ways to prepare for failures, and more.

Featuring

Paige Cruz
Paige Cruz

Senior Developer Advocate
Chronosphere

A man with a beard smiling for the camera while recording a podcast.
Ryan Donovan

The Overflow

A bald man, part of the Ops Teams, wearing glasses and a plaid shirt, hosts a podcast discussing pets.
Ben Popper

Director of Content
The Overflow

Transcript:

Ben Popper: (00:09)

Hello everybody. Welcome back to the Stack Overflow podcast, a place to talk about all things software and technology. I am Ben Popper, director of content here at Stack Overflow, joined as I often am by my colleague and collaborator, Ryan Donovan.

Ryan Donovan: (00:25)

Hi, Ben. How you doing today?

Ben Popper: (00:26)

I’m doing good. We’re going to be talking about one of your favorite metaphors today: Pets versus cattle. It’s come up on the blog. I’ve seen it in the newsletter. It’s your trusty go-to. Today’s episode is sponsored by the fine folks over at Chronosphere.

Ryan, what are we going to be chatting a little bit about today, before I introduce our guest?

Ryan Donovan: (00:44)

You know, the typical pets first cattle metaphor is that you want your servers to be cattle – sort of disposable and  easy to process, instead of pets; where you name them and love them. We’r going to be talking about the ops folks. And you want them to be pets, not cattle. They are not disposable,

Ben Popper: (01:01)

Neither your servers nor your people are disposable. I got it – makes sense. So,  joining us today from Chronosphere is Paige Cruz, who’s a Senior Developer Advocate over there. Paige, welcome to the Stack Overflow podcast.

Paige Cruz: (01:14)

Thank you for having me. Longtime lurker on the Stack Overflow forums and first time podcast guest. Very delighted to be here today.

Ben Popper: (01:23)

Good. I hope you’ve copied and pasted your fair share. Everybody deserves to lurk.

Tell us a little bit about yourself. How’d you get into the world of technology and how’d you end up at the role you’re at today?

Paige Cruz: (01:32)

I had a bit of a roundabout way of getting into technology. In school, I studied a blend of mechanical engineering and business, because I’m very hands-on. I don’t like abstractions – ironically, because I ended up working in cloud computing, where I press a button, and all of the sudden lots of things happen under the hood and we’re built on top of so many layers of abstraction. At my first software company I was recruiting and managing our interns. And I thought: “Hey, we’re paying these interns a lot of money. I passed my Python class in college. I could do this coding thing. Maybe I’ll give engineering another go.”

And so, that led me on a journey to getting on my first team where most of the company was on-prem and we were given an AWS account and said: “Hey, our expensive servers are not going to be powering the demo. My team managed the demo for that company. And so, we were handed this pile of Terraform and Ansible and EC2, and had to make a disposable demo environment spin-up in 10 minutes, spin down pretty quickly. And that really led me to this “pets first cattle.” I ran the Jenkins box. It was kind of my entry point to this whole DevOps SRE (site reliability engineering) infrastructure world. And I just kept asking: “How does this work? What’s under the hood?” until I felt like I’d reached the level that I wanted to stay at.

Ryan Donovan: (02:55)

So, let’s talk about that ops work. A lot of ops is [about] dealing with code and production, and especially, when fires happen. I’ve gone out with friends to the bars with a laptop – in case anything broke. First of all, what is the worst place you’ve gotten paged? And second, how can we make those pages a little less stressful?

Paige Cruz: (03:19)

I love that question. So I’ll answer for myself. My husband is on the pager as well. So, we were two-pager-household for a while.

But the worst part for me was when the Organ Symphony was playing along with the Harry Potter movie, and they were doing the soundtrack and score. PagerDuty bypassed the silent mode. That thing started blaring, and I could not turn it off. I’d spent a decent chunk of money to be near the front. So, I had a very long, embarrassing, skip-up the hallway to get out of that symphony room. Worst place by far. I’m sure the listeners know, but just for me, what is the worst place you’ve been paged, Ryan?

Ryan Donovan: (04:01)

I’ve never been on PagerDuty, thankfully.

Paige Cruz: (04:04)

Oh my goodness. I can’t even imagine my life without PagerDuty. But you asked for a great follow up, which is: yes, we’re going to acknowledge that this on-call world is stressful, but how do we make getting pages a little bit less stressful? And for me, it comes down to a few different buckets: one is making it a familiar process to act an alert, to even declare an incident the first time. You do not want your engineers, for the first time declaring an incident, to be the first time they’re looking at the docks and understanding what severity it is. When it’s unfamiliar, that heightens the stress – because you don’t know if you’re doing the right things. And it’s this charged environment.

So, I like to recommend that folks treat their internal incidents, whether that’s your CICD build system or GitHub, or any of the dependencies you have from PR to production. Treat those dependencies going down as an incident. Run it with status page updates just to your dev team. Have an internal one and just get people used to that motion of declaring the incident: coordinating, knowing where to look for information, so that when it comes to the big time production incident, they’re like: “I know where this button is. I know how to spin up the template.”

Ben Popper: (05:30)

Run through the plays and the playbook before you get to the runbook. Yeah, for sure.

Paige Cruz: (05:34)

Exactly. And then, the other thing I think is putting it into practice, which is training. I’ve seen a variety of training from tabletop scenarios. What would happen if we failed to rotate the certs? How would we respond? I honestly used to kind of shrug off tabletop exercises. I thought that if it’s not in production, it doesn’t count. But, what I realized is that there’s a few different layers of skills and knowledge. And even just saying: “This is how I would respond as a staff engineer versus a junior engineer,” is a really helpful conversation to kind of share knowledge and say: “That’s not the dashboard I would go to.” or: “Have you heard of this full tool called netstat?” So, the tabletop is sort of your easiest entry point. A mock incident in production would be great.

Paige Cruz: (06:23)

But, a lot of folks use staging, where you take out a dependency, inject some latency, some jitter, whatever it is to actually trigger your alerts, and go through that end-to-end process. And then finally the last step, which is sort of devoid of the actual training of incident response, is getting familiar with your monitoring and observability telemetry data. That is the place that is least often referenced in engineering onboarding. It is the stuff that isn’t covered in college and boot camps, and it’s the stuff you gotta learn on the job. And, I would rather have my engineers learning in calm stress-free scenarios, where they’re just discovering the metrics available or: “Oh, cool. We have traces that are 60% complete. How can I use that?” versus getting paged and furiously trying to figure that out on the fly. S, if you hear a theme here, it’s all about proactivity in my book.

Ben Popper: (07:19)

That’s awesome. I feel like my experience at Stack Overflow is that, if you could make this into a competitive role playing game where all the engineers could keep track of their scores, then they would be practicing constantly – whether they’re working or not.

Paige Cruz: (07:34)

Yeah. I try to tell people: “I work at an observability company. When I open up any observability platform, I see it as this treasure chest of information.” And, even if I know the instrumentation is not there, even if I know there’s some things I have to work around with labels or cardinality, I love just digging in there to see. That’s how I self-onboard. So, I try to tell folks: “Hey, even if you don’t love the data you see in there, that’s your inspiration to instrument more. Add a label, keep it going. Because, at the end of the day, we all depend on that data flowing through – whether or not it’s the infra team, versus the app team, versus eng, all of the data’s related. All of the dependencies are linked together in some way.

Ryan Donovan: (08:20)

The ops isn’t just firefighting. There’s the regular monitoring code in production, and observability helps with that. But, what are other ways we can make that monitoring a little bit easier?

Paige Cruz: (08:36)

So, I’ve spent my career in monitoring and observability. That is the bazillion dollar question, right? How do we make this stuff easier? And, the challenges that I see are where knowledge is shared, especially in a distributed remote environment or a hybrid environment: an anti-pattern that I see are people becoming very familiar with Slack’s query language. There’s sort of this split of people who can operate really well, with instant messaging. And then, there’s some people like me who really love forums. I wish we had the mailing list back again, because that is the pace that I want to go through information, and not be pinged all the time. So, for me, how we make this monitoring easier is how can your engineers know where to go? Figure out what the standards are for your org. What does a well-formed alert look like? Is there some internal instrumentation library that will decorate every piece of telemetry with the team owner, and maybe even the service tier if you’re that fancy?

Knowing the scope of what you’ve got available and what that line of responsibility is between your observability team, slash your DevOps team, slash your SRE team and now platform team. The ops rebranded versus developers: where is that line for your organization? So, how do we make it easier? First, knowing who to go to when and for what. And the other thing for me, is having a durable knowledge store. I’ve seen it with Wiki, with Confluence, wherever you’re keeping this information. But, getting it out of that instant message, constantly time is moving forward…You’ve got to be able to ask and learn those questions in a calmer setting.

Ben Popper: (10:30)

We do spend a lot of time here thinking about: “How do you keep the institutional knowledge fresh?” How do you make it easier for people to ask questions, not to feel intimidated and for anybody who might be on a particular team, or maybe on another team, but happens to be familiar with that technology to come in and give a good answer.

So, definitely something that Ryan and I spend a lot of time thinking about. It’s interesting you bring it up, it’s a product, but it flows from the thesis of the public site, which is like crowdsourcing knowledge, and breaking down silos to make sure everybody feels prepared. And in a situation where you’re stuck, or you don’t have the knowledge you need to be able to get to it as fast as possible.

Paige Cruz: (11:13)

Oh, totally. I sit now in marketing, I have given up the pager. I was treated as cattle for far too long, and I said: “You know what? I’m going to go into marketing.” But I was given access to our stack overflow for teams, and it really came in handy. Well one, I just like to creep and learn. I haven’t given up my engineer heart – I still want to know how we set up in front, and do all the things. But, when I was approaching learning about our on-call, to write a post about making our on-call holidays a little bit more equitable ( because my husband was on-call for Christmas three years in a row and I felt that pain,) I said: “Never again, not on my watch. We’re going to distribute this work.” But, I had to ask in stack overflow for teams: “Hey, how do we treat holidays on-call? Is it volunteer? Do we pay extra?” And within a day, I got a really great answer from one of our developers, and other people who came after me were able to benefit from learning that information because of the way it was presented, versus getting lost in a sea of IMs and lost to the ether. So, yeah. I’m a big fan.

Ryan Donovan: (12:19)

Yeah. When that incident happens, you don’t want to be chasing a bunch of people down at two in the morning, or whatever time the incident happens.

Paige Cruz: (12:27)

Yeah. And there’s a great quote from “Seeking SRE”  I’ll just read, that really  ties it together with a bow – at least from the SRE perspective, which is the quote: “Your job is to make knowledge, context, and history available to engineers who are making decisions daily – while working on features without your oversight. Building a knowledge base of design documents creates the structure necessary to build context and history around the architecture. And so, for me, that’s really where it’s at: Those connections, who depends on what, when was the last time this alert fired and did somebody do something about it? Did they act it, was it actionable?

I would pay PagerDuty so much to just have a Tinder swipe for “actionable, not actionable” at the end of your shift.

Ben Popper: (13:16)

It’s interesting – you talked about managing some of the more human, or HR elements, to help out your partner who was on PagerDuty. That there is a bit of cultural learning that can be helpful here too, along with the sort of  technical side of it.

As you’ve learned over the years, and specifically in your role now, what are some tips you have for managing burnout, best practices that orgs can apply that will keep their ops people happy and successful?

Let’s discuss that first, and then maybe transition a little bit  into what you think, from Chronosphere’s perspective, how your tooling is built to accommodate that, or to accomplish that?

Paige Cruz: (13:55)

Oh, burnout. This was a hot topic when I started my career about six or seven years ago and we still haven’t solved for it. We still have to talk about it. So, when it comes to me, I mentioned that I burned out about last year, right around this time. And I just said: “I don’t think it’s sustainable for me in my life to carry this pager, and have a small on-call with a really big range of responsibilities.” Sort of the ratio of people on-call, to services we owned outta whack. And, I think my recommendation is to always start by talking to your manager. You have to be open about this. You have to be open when you feel like you’re starting to burn out, because there are lots of things your manager or the business can do to help you. They can put you on a different project, they could give you a couple weeks off. You have to explore those early interventions before it gets to the point where you’re saying: “I’ve just gotta quit. I have to go be an alpaca farmer. That’s my thing.” You want to catch it as early as possible, and treat it very seriously. A lot of my advice is for management, because I would like to see more people intervene. So, looking at the last year of stats, how often was a particular member of your team on call? How often were they interrupted after hours on the weekends? How is this bleeding into their personal life?

Those are pretty big stats as well as sometimes switching teams and saying: “Hey, I’ve been on the internal tool team for a really long time. Doing help desk via Slack is just totally killing my focus. Can I go be on a backend team for a couple quarters? Some sort of rotation – getting people that empathy of what it’s like to be on the inside of an internal tools team.

Ben Popper: (15:45)

It sounds like you’re serving on the front lines of a military operation here. Everybody knows you gotta rotate to the back every once in a while. Recover yourself. You don’t want to be too shell-shocked all the time. Then, you can get back to the firefighting after a bit. It’s very important.

Paige Cruz: (15:57)

Totally. And I’m a big one for reflection and introspection. So, as an organization, are you looking back to see: “Hey, we onboarded five new vendors. Are they still working out?” Should we continue the relationship or are some of them falling short of the support that we were promised and expected? We’re not seeing the customer enablement. Where are the workshops that we were promised? So, I would also take a pulse. Right now, the economic climate is a little frosty. People are feeling the stress. What can we take off plates? What is adding that cognitive load? And hopefully, it’s not your observability tool, but studies show that it most likely is a contributing factor. And so, just really talking to the people holding the pagers, they are the experts in their experience and they are the experts in knowing how your system’s doing and what mitigations and interventions and maintenance need to be done. And, I think it’s time to listen to the operators.

Ryan Donovan: (16:59)

Go back to the original matter metaphor. Do you think that operators were treated as cattle for a while or are still?

Paige Cruz: (17:06)

Yeah, and maybe not it’s not as extreme as: “I’ll fire you if the Jenkins box blows up on Thursday afternoon,” but, there’s really this idea that ops are there. They’re always going to be there. They’re the ones you call. Like for me, in SRE, I would be called in to work on services, or to troubleshoot services. I hadn’t touched languages and platforms that I was unfamiliar with – Hello Elixir and functional programming. And so, over time I built up that institutional knowledge.

But, every operator is different. We don’t really have an ops school. We don’t really have “getting your PhD in tech ops,” right? And so, every operator that we have today has lovingly and sometimes painfully built up this knowledge of experience and this bench of real world skills. And, we are pets. I mean, you really should think of and treasure your operators because they have one of the hardest jobs. Again, [this is] bias coming from an op — a former operator’s perspective.

Paige Cruz: (18:09)

But, we have some of the hardest jobs because of the context switching, and because of the level of responsibility we’re trusted with to have the foundation of the platform… all of that jazz. So, it does feel like, with Kubernetes, we sort of lobbed Kubernetes over to the devs and said: “Hey, it’s great for us. It really automates a lot of the scaling concerns we had.” And, now we have a lot of developers who are like, what’s a pod? What is pod crash loop back off?

And it feels like we’re still negotiating that line of what does it mean to build it, own it, run it, maintain it, operate it, observe it. And, how do DevOps work together? How do you feel? Do you feel like we’ve made some strides in how operators are viewed and treated?

Ryan Donovan: (19:02)

I think we end up talking about this a lot. So, it’s still an open question.

Paige Cruz: (19:09)

Work in progress.

Ryan Donovan: (19:10)

Yeah. We just published an article talking about SRE team topologies – like how do you set up an SRE team? Right? Who builds it, who runs it? And how do you be successful at these various team configurations?

Ben Popper: (19:25)

Paige, to your point, this is something that has come up quite a lot, not forced by Ryan and I, because it’s our favorite topic, we enjoy it. But just because it’s really pertinent to the way modern software organizations are run. SRE topology is one example. Another one that we’ve run recently had to do with observability debt. And the fact that, in today’s environment, as you said, with  microservices and Kubernetes and containers and orchestration, often that act of observability becomes far more difficult than it would’ve been with a bunch of local machines, and a few network connections. So, it’s definitely taking an increasingly central role. And, I guess the thing that makes it kind of stand out, that’s a little bit probably rubbing folks the wrong way, is that you don’t get credit, or it’s people who don’t always get credit when things stand up and don’t fail. It’s like, great. You get the flag when things fall flat, and then every once in a while, there’s a big incident that’s responded to quickly or efficiently, and then SREs get to shine. But, a lot of the time, the good work that keeps things running goes unseen, right? As opposed to shipping a shiny new product. So, maybe that’s some of the disconnect between the hard work that happens, and the burnout, or the recognition of the hard work, in a way.

Paige Cruz: (20:37)

Oh, totally. It’s very unglamorous, and I think it’s a little bit easier of a connection for a lot of folks to make, to: “I’m a software engineer, I built a feature. You can listen to Spotify with your friend across the country.” Like, that’s something that, whether or not you’re a software engineer, you can kind of get. But, explaining the beauty of how you automated maintenance for your Kubernetes cluster: no one is taking the bait on that. I will say, I’m on that note of recognition, I’ve been delighted, as this is the year of cost cutting, cost efficiency. Many orgs have talked to us, where folks are saying: “Hey, when SRE is able to cut the bill or meet these targets that the company set, all of the sudden we’re getting applauded.” Because, the work that SRE has done is directly tied to a business goal. So, this may be the year – soak up all the praise you get, who knows if it’ll come back. But, at least your cost cutting projects. Take all the shine you can.

Ben Popper: (21:37)

Okay. Check your Zodiac calendars. This is the year of SRE. You heard it here first.