Podcasts

Podcast: Slight Reliability | Cognitive Overload

Cognitive overload. We discuss it, define it, share experiences, and give examples. What are the benefits we might see if we reduce overload and provide sufficient slack to our engineers? Starting with situations that lead to overload and some things that help alleviate it, and pinpoint observability as one area that can help!

Featuring:

Paige Cruz

Senior Developer Advocate
Chronosphere

Paige Cruz is a Senior Developer Advocate at Chronosphere passionate about cultivating sustainable on-call practices and bringing folks their aha moment with observability. She started as a software engineer at New Relic before switching to Site Reliability Engineering holding the pager for InVision, Lightstep, and Weedmaps. Off-the-clock you can find her spinning yarn, swooning over alpacas, or watching trash TV on Bravo.

A observability expert with glasses and a beard looking at the camera while discussing dashboards in a podcast.

Stephen Townshend

Developer Advocate
SquaredUp

Transcript:

Stephen: Welcome back to Slight Reliability. I’m Stephen Townsend and this is the show where we learn about SRE and observability one week at a time. Today we are talking about cognitive overload, observability, fluency, and a whole bunch more. And joining me today is Paige Cruz, who is a senior developer advocate at Chronosphere.

[00:00:24] She’s passionate about cultivating sustainable encore practices and bringing folks their aha moment with observability. She started as a software engineer at New Relic before switching to site reliability engineering holding the pager for InVision, Lightstep and Weedmaps. Off the clock you can find her spinning yarn swooning over alpacas or watching trash TV on Bravo.

[00:00:49] Hello, Paige. How are you?

[00:00:51] Paige: Great. I’m doing really well today and yes, I will talk about alpacas anytime, but that is a different podcast.

[00:00:59] Stephen: Thing I know about alpacas is I heard they spit at you. Is that true?

[00:01:02] Paige: Yes, the camelid family, they do tend to spit. We do, we are too polite as SREs to do that. But when you get frustrated that’s their way of blowing off steam

[00:01:14] Stephen: So Paige, what, why is observability important to you?

[00:01:19] Paige: Oh my gosh. For so many reasons. Very practically, it has been the foundation of my career. I was reflecting lately on why am I on my third observability company? Why can’t I just quit this niche? What is it that keeps bringing me back? And I really think it has to do with the, power of democratizing data. When I think of observability, I think of really rich contextual system signals that you can ask whatever question you want. I’m a very inquisitive person so for me, there’s something beautiful about troubleshooting in a system with high observability that the rapid fire: question, answer, question, answer. Zooming at a system level, down to a container in a pod and back out again. And let’s look at historical. There’s something so magical about that experience, and it is not the experience I think a lot of people have today with their monitoring and observability tooling.

[00:02:26] Since I’ve had a taste of working in these highly observable systems, I’m very keen on getting everybody to , the aha moment, I want to bring everybody, the joy I feel when I look at a page of a summary page of metrics. I want people to see that as a treasure box of goodies, not as this boring part of their job or something that they have to go do because the bill got too high and now they need to actually look at this stuff.

[00:02:53] I think of your observability telemetry as a toy chest, I’m excited to play with the data, ask the [00:03:00] questions and explore.

[00:03:01] It means a lot to me because one, it’s been the foundation of my career. I just have spent a lot of time thinking about how to make this easier for folks and because, it’s helped me on-call, it helps my husband on-call. It helps all of my friends who are on-call. The more observable our systems are, the better they are for the humans in the loop who have to operate this stuff.

[00:03:25] It is not a joke that we are on call and carry a pager it may be a cell phone, but that’s a very real responsibility that we should acknowledge.

[00:03:34] Stephen: One of the challenges particularly for roles like SRE pretty much any senior engineering role in the industry is cognitive load.

[00:03:45] The stuff that you need to keep in your mind, to do your job effectively is getting more and more difficult this, more and more stuff to hold in there.

[00:03:54] I just talked about on the show before, and you and I had a chat a couple of weeks ago around this concept of cognitive overload. Is it real? Is cognitive overload a real concern? From your perspective, before we even go further…

[00:04:10] Paige: Absolutely. It can be as simple as thinking of how many clicks does it take for me to get to where I want to go or to get an insight or an answer. If you’ve ever seen the Tootsie Pop commercials of the owl, it’s a cartoon owl, and he’s like, how many licks to the center of the Tootsie Pop for me?

[00:04:29] I’m like, how many clicks until I can get to the fricking signal and the answer that I’m looking for? Because if I have to go from tool A to tool B back to Tool A again, now that I have a bigger thread to pull on -that stuff takes up a lot of brain space and it also upsets me as an observability practitioner.

[00:04:50] Stephen: I think it was in Team Topologies. One of the books I read recently, it talked about not just individual cognitive load, but also team cognitive load. I think that there’s a [00:05:00] aspect of communication, like there’s only so much that a single team can collectively work on together before they get overloaded.

[00:05:07] I guess there’s communication elements in there and how that information is shared and to create the hive mind of knowledge

[00:05:13] Paige: Absolutely. When I started going to DevOpsDays when I started my career about six, seven years ago I still was hearing about the stories of when Dev threw it over the wall to Ops. Oh my God. And I don’t think we ever moved on past that.

[00:05:30] The meme that I see now is Ops got fed up and then just shoved Kubernetes at the Developers and said, its a platform! Now you deal with this complexity;. Do you think that we’re still having that tug of war of what the right line is between developers and SRE?

[00:05:48] Stephen: Absolutely. I totally agree. And then you push it back and it gets pushed back,; Hey, here’s a platform, here’s Kubernetes or any platform as a developer, now you can do it for yourself.; And then developers are,; but no, I don’t wanna deal with [00:06:00] that.; Its the hard stuff and there’s something potentially intrinsically satisfying and more complex about operational platform type work potentially as opposed to development. Maybe there’s a massive bias or judgment call on my part.

[00:06:17] I dunno, I get the feeling that no one really wants to do Ops , but it’s needs to be done. And it’s the, Dev is the sort of artistic one one, and you get be at the front, be the hero on the podium.

[00:06:31] Paige: Yeah. It is! I was warned before I really went heads down into the SRE path that it is unglamorous, it is not celebrated. It is if the lights go off, yeah, we care but if the power’s on, nobody’s applauding you for that. Earlier I was like, I don’t need roses, I don’t need accolades. But it wears on you year after year of having to hold all of this state.

[00:06:56] You look around and you’re like gosh, my team is tasked with knowing the whole path from PR to production. Not only every single environment that’s there, the multiple clusters we have in production, but I also now need to know the CI/CD pipelines, and I have to help folks troubleshoot that.

[00:07:13] And oh, by the way, we never trained our developers on Kubernetes. So they’re asking what a Pod is and you’re like, ;Whoa. You hired me and you asked me all this Linux stuff and now I’m finding out I’m a teacher, an advocate, an evangelist.; I think that’s where the cognitive overload really comes into place for the team.

[00:07:33] For the team boundary, what are you really expecting your SREs to know? Was that different than what you interviewed them for? Has it evolved as your platform or product that you’re selling has evolved and the architecture’s evolved?

[00:07:48] I tend to call that kitchen sink SRE, where you’re that junk drawer where you can find a spatula, a whisk, thermometer. We’re kind of the Swiss army knives of a [00:08:00] department, and that is an exhausting role to play.

[00:08:03] Stephen: That’s a lot of pressure to put on that team.

[00:08:06] Paige: Yeah.

[00:08:06] Stephen: In your career, had experiences of cognitive overload or something close?

[00:08:13] Paige: Yeah. For me, it comes down to not knowing the why. I can think of one system that I was trying to onboard myself to, and there were three proxies on the path from edge to container. And I just could not understand. I think one was a reverse proxy. And, of course there were reasons, but what I kept getting told is, , we don’t need them anymore. Don’t worry about that. You don’t need to think about that.; And I’m like, but I do because when something goes wrong and I don’t know what happens from A to B in the handoffs from C to D, that makes my life really complicated. And why aren’t we simplifying this?

[00:08:58] If we don’t need this, please, let’s [00:09:00] delete it. What ended up happening was I did have to investigate an incident that was due to me being unfamiliar with one of the three proxies I knew about the other two had accounted for them in my testing. And then that third one was a surprise to me.

[00:09:15] That’s what actually ended up causing an issue. I’m pretty good about not putting blame on myself and not letting others play the blame game. But I think it spoke to, if I couldn’t understand this and this past PR checks on my team and multi my sister team on infra side – how could we expect an engineer to understand this stuff ?Or to self-serve? The cognitive overload of just understanding how are requests traveled through our system when that is a complex story, woof.

[00:09:48] It’s time to either chop out some of those proxies or for me, again, get to the why. If I know why something’s there, even if it’s an unsatisfactory answer, it [00:10:00] sticks in my mental model. But just to have people say, oh, that’s not important. Over and over my brain started writing that stuff off.

[00:10:08] How about you? Does that resonate?

[00:10:10] Stephen: I have a feeling we’ve had a slightly different sort of career trajectory that most of my work has been for complicated, enterprising government rather.

[00:10:19] I could see in that sort of enterprise world that’s, a common situation, like a really big complex solution and a program to work on it and just no one can hold it all on their brain or no team can even hold it all on their brain.

[00:10:30] Paige: Yeah. To me it’s the making that those expectations explicit, because if no one’s telling me the bounds of what my team owns, it feels like SRE has to just fill in the gaps.

[00:10:43] Implicitly it feels like people are saying none of the devs know the full end-to-end picture _you_ better know that end to end picture.

[00:10:50] I think a lot of folks find themselves in that position, surprisingly with microservices. I remember when we were all breaking up the monolith going into [00:11:00] microservices. I gave this talk Microservices means Macrocommunication. I totally made up that word -macrocommunication means nothing it just sounded nice. The whole gist of it was the monolith was nice. You had engineers that understood one stack, one framework. Was it messy to share responsibilities? Sure. Now that we’ve got all this constellation of microservices, I have so many different teams that I have to talk to get an API point updated or deprecated, and the human to human or team to team communication then becomes super important, which is ironic because everyone moved to microservices for: pick your own stack, go at the speed you want. To be this little isolated unit. Do one thing well. I’m like we didn’t fulfill that vision in my opinion.

[00:11:51] Stephen: Interesting that more communication doesn’t mean better communication as well. If you don’t create formal communication channels between the teams or the people, the stakeholders that need to talk to each [00:12:00] other, then informal ones get created and you won’t even know. I think maybe it’s a leadership as a leader, you won’t necessarily see all this chatter, which is going on in the background, which is actually draining everyone and

[00:12:11] Paige: Oh, absolutely. One thing I’ll say is that’s where instant messaging that probably burned me out the most as an SRE. Once instant messaging came into the game and people thought, okay, I can do help desk through here, or oh, the SREs are just a instant message away. The quality of questions really started to go down and it, and that adds to cognitive overload.

[00:12:35] If I’m answering the same 101 level questions over and over, and the organization isn’t going to support me by training folks or saying, ;Hey, we expect that you know XYZ walking in the door – that’s another form of cognitive overload for me, of just repetitive answering questions.

[00:12:52] Stephen: Yeah.

[00:12:52] Paige: I always wanna help developer. I would always drop what I’m doing to help a human out. But again, it gets fatiguing and [00:13:00] it gets overwhelming after a period of time.

[00:13:03] Stephen: Bit of a side note there. We had the same thought in my previous role and I think they’re still working on it now, an internal stack overflow type system where once you put or you post the question there and then the answer’s there forever.

[00:13:16] Have you ever seen that before?

[00:13:17] Paige: We at Chronosphere use Stack Overflow for Teams. So Stack Overflow was like, brilliant, we’ve already got all the devs coming to us. Let’s turn this around. I’ve also used Confluence knowledge base. I don’t care what tech it is.

[00:13:30] Whatever durable knowledge store you have that is such a rich source of onboarding. The quick answers that you think an instant messaging tool will get you actually, you’re much better served by a big knowledge base that’s been built on year over year.

[00:13:50] Stephen: Being so I haven’t been on-call slash at all. You have. Tell about alert fatigue.

[00:13:59] Paige: Oh my [00:14:00] gosh. Personally being on-call a lot because I worked at small to mid-stage startups, so the SRE team was tiny but mighty. And again, was kitchen sink SRE. We did IT, we did security. We dabbled in a lot. I would say being on a small on-call rotation was one of the primary factors that I personally burned out.

[00:14:26] I would talk to people on my team about it and they’d say ;the pager doesn’t really go off;. And I’m like, yeah, but _it always could_. And that’s what my brain would hold onto as well. It always could go off. You don’t know what’s gonna happen. And because of my kind of knowledge area with observability, I know how much of the system’s not instrumented.

[00:14:50] I would have some sort of confidence level in the quality of alerts and how much of the system we had visibility into. And oftentimes it was [00:15:00] not super high quality alerts, bad signal to noise, and knowing that there were some gaps in the monitoring or that because I was an SRE and would need to troubleshoot app level stuff, I would be going into other teams, dashboards that have their own idea about how things should be set up.

[00:15:19] Don’t have maybe the nicest chart names would all be the metric names. And I’m like, that doesn’t tell me what is happening to the customers right now. And that’s what I care about because I’m not in, I’m in mitigation mode not in actually fix it for real mode.

[00:15:34] So that drain of constantly being on call for a system. It wasn’t even trust but verify with the alerts. It was verify, then trust it. I was spending cycles to see should this alert have fired? When was the last time someone touched this? What’s the history of this alert? Can I see if the last time anyone act this alert that they actually did [00:16:00] something and what was it that they did? I’d do a whole mini investigation before even. getting to the actual signal that triggered the threshold to see if there was a real problem.

[00:16:10] And that alert fatigue over time, you just become desensitized to the noise and that’s a very scary place to be. It’s very scary to have a new engineer join and say, oh my God, how are you looking at all these alerts and reply I physically cannot. I am one human. I know that there’s stuff in there that is probably a simmering fire, but it is so lost in low quality signals. I’m so tired. I’ve been doing this a long time. You’ll get to this jaded spot. It’s not even that it’s not fun -it doesn’t feel responsible.

[00:16:42] Alert fatigue is something that I really caution people to look out for in their organization. It could be a particular team that is snowed under with alerts. It could be a particular part of the platform. Maybe your company acquired somebody and, they’re having [00:17:00] trouble integrating to your holistic monitoring strategy, whatever it is whether it’s the whole system or parts of the system I think tackling alert fatigue is probably the best thing you could do for addressing the cognitive overload upfront.

[00:17:16] I can’t say that I recommend on-call -do your homework before you join an org.

[00:17:21] Stephen: I feel like there’s a whole other episode on, just on alert fatigue and to set up your alerts in a way that is effective.

[00:17:28] Paige: There’s a great book Effective Monitoring and Alerting for Web Operations. It’s an old O’Reilly book, but it focuses mostly on the concepts, not on a particular tech stack. That is the one I recommend folks go to.

[00:17:42] Stephen: In preparation for this discussion, we were thinking about examples of situations which might lead to cognitive overload in the wider sense.

[00:17:51] I had a few ideas. One of them is having unfocused teams. So if you are in a team and the purpose of your team isn’t clear [00:18:00] -Team Topologies by Matthew Skelton and Manuel Pais illustrates this beautifully.

[00:18:04] I think that teams should either be: a stream aligned team, like a product team, right? Or they should be like an enablement team, which is enabling these other teams. Or they’re like a platform team who runs the stuff that helps other developers do their work. Or maybe they’re a complex subsystem team who runs this really complex weird thing on the side.

[00:18:21] Anything beyond that is ineffective, and if you’re trying to do two or three of those things at once, it gets overwhelming cuz you just, you don’t have enough focus. I think that’s one thing.

[00:18:30] Paige: Yeah, I totally agree with that.

[00:18:33] Stephen: Another one is having an over-reliance on senior principal staff engineers who have all the knowledge and capability. And it’s trying to get that knowledge and capability out and spread it around the organization. But if you don’t, then of course those individuals become a bottleneck.

[00:18:50] Paige: Yes.

[00:18:50] Stephen: Very difficult.

[00:18:52] Paige: Yes. I’m so curious about what, as an industry, what do we think [00:19:00] we’re doing for the next generation of operators? We should be raising them up and training them. A lot of the Ops folks I know have been like ;I learned the hard way. And oh back then I took out a whole rack and never did _that_ again!;

[00:19:15] I’m like, gosh, we’ve got to move past this. Everybody’s in it for you’re on your own learning journey and it’s probably gonna be awful. By the time you are 10 years into the business, you’ll have seen everything. You’ll have helmed these incidents and then you’ll be this kind of magical senior engineer who can respond instantly and know what’s going on.

[00:19:37] I think we’ve definitely gotta do better than that. You’re right, that it is about that knowledge sharing. How do we translate those years of seasoned experience into bite-sized lessons that are relatable and useful for people in their day-to-day? I don’t know!

[00:19:55] When did you feel that you had gotten your ops chops?[00:20:00]

[00:20:01] Stephen: I don’t think I ever got them really. I I’ve, did 13 years in performance engineering, so I was working in the delivery space. I was dabbling in sort of monitoring of production and I wanted to get involved in that space. Because of the experience of I’m doing all this stuff upfront, but actually does it even mimic the real world?

[00:20:19] What’s actually happening in the real world? That’s where the value is. I wanna work in that space. I don’t know if I have my chops yet cause I haven’t been on-call for a complex system and I think that’s the rite of passage.

[00:20:29] Paige: Yeah. I even had a, I had a moment where I was like, I don’t know if I’m, an operations engineer. My friend was like; yeah I’ve watched what you’ve been doing, you must certainly are;. Maybe that speaks to the never ending learning journey that we’re on in tech. When is it enough knowledge? When are you seasoned or senior or principal? I don’t have good answers. I just know I’ve seen some things and I’ve learned some lessons.

[00:20:56] Stephen: Yeah. Another situation which lends itself to cognitive overload [00:21:00] is not having an organization who doesn’t have priority or budget for operational work.

[00:21:06] Paige: Mm-hmm.

[00:21:07] Stephen: I think you’ll inevitably end up with this massively growing tech debt, fighting fires constantly, and then everyone’s under the pump. And how can you move on from that point because it’s just gonna get worse and worse.

[00:21:19] Paige: Absolutely. The toil that some people put up with -I eventually started to tell everyone when I was an SRE, if you are doing a manual task or operation on a somewhat regular cadence please come talk to me. Like I’m happy to automate that for you. I do not want engineers clicking. Again -how many clicks does it take?

[00:21:41] I don’t want engineers clicking five different things to just get something out the door or check on and deploy. The budget for priority improvement or simplification work we used to call it gardening work. It’s what the engineers like to do. We feel good taking care of and maintaining our systems. But [00:22:00] it’s hard to argue that we should upgrade the database because we always wanna stay two behind the latest when the other chunk of work is delivering a new feature. That’s always a really difficult debate to get around.

[00:22:14] Like you said, it has to be a part of the culture. to do this gardening work. It’s not something you can continue to push off year after year. It’s definitely something that should be a part of your on-call. I personally don’t think on-call should take sprint points ever. That really is your time to dedicate to fixing the system, but that’s, even that is a really hard battle to win.

[00:22:38] Stephen: I think if anything in the moment there’s a lot of tech companies, big tech companies laying people off and, it’s supporting this idea that I think a lot of companies really just care leadership, very high leadership, cares about making money for shareholders and a lot of other stuff doesn’t seem to matter as much. I think’s crazy. I think it’s bad.

[00:22:59] Paige: It’s [00:23:00] wild. It’s such a short-term view to take on something that I think should be given the medium to long-term view. Whether or not the system that you have or the product or platform you’re selling you love today, there’s certainly nuggets and pockets of brilliance in there that couldbe a business pivot.

[00:23:20] I don’t think it’s fair to let these beautiful systems just wither and rot away. We’ve gotta put in some basic maintenance.

[00:23:29] Stephen: I think something else which adds mental complexity is when you, it’s the architectures of the systems that we work in and some architectures are better, more conducive to mental health than others. So I think tightly coupled architectures, which also based on means tightly coupled teams,

[00:23:47] Really difficult. More complexity more needing to orchestrate things. Not doing the work upfront, like I said before, to take a complex problem and break it up effectively into smaller problems [00:24:00] which a single team can own and not and manage themselves and operate themselves, and that’s a lot.

[00:24:04] It’s hard work, but if you do that work front, I think it makes a big difference.

[00:24:08] Paige: Yes. I have not yet worked at a place that has done domain driven design. I’m not sure if you’ve heard of that book or philosophy, but the gist as I understand it, was break down your architecture into like logical business domains and get everyone in a certain domain using the same language, whether they’re the product manager, a support engineer, a SRE, a software engineer.

[00:24:36] Having people use the same language to talk about the same things is actually a surprisingly difficult thing to do in tech when we’ve overloaded every word under the sun. I would love to work at a company that has done that.

[00:24:52] In monitoring, I know my way around. I know our customers, know the basic use cases, but for something like e-commerce or [00:25:00] gosh, like even streaming platforms like they’re it would do you a lot of good to sit down and figure out the domains at at the start.

[00:25:12] Stephen: I know and the interesting thing is that as engineers, I don’t think we don’t have the skills alone to do that. It needs to be a full organization thing.

[00:25:22] Paige: Yeah.

[00:25:23] Stephen: Yeah. So what’s the antidote?

[00:25:26] Paige: How do we solve this for everyone? Good question.

[00:25:29] Stephen: Paige, where you go?

[00:25:30] Paige: It’s why they pay me the big bucks. I think there needs to be a recognition that there is this problem. I think if you go right into solution mode, you’re missing the human part that you have humans that are cognitively overloaded, they’re maxed out on RAM..

[00:25:50] Working in these systems is taxing and draining. So you have to acknowledge and give some space for the feelings and the, my God, [00:26:00] finally we’re gonna work on this.; You’re gonna have some salty people. That’s just how it is in Ops. We put up with a lot, and so I wouldn’t be surprised if people are a little crunchy. Once you have that debrief, then start to look at focusing your teams. What is a sane boundary for your SRE team or central observability team? They can’t do everything. They absolutely can’t. The best thing you can do is say, this is what’s not in scope, or this is an unowned shared piece of our infrastructure that needs investment.

[00:26:33] I would love if SRE could give up responsibility. I don’t know about you. It would’ve been nice to say, Hey, we, our plate’s not big enough for that.

[00:26:43] Stephen: Yeah, like we said, decoupled architectures where large, complex problems are broken up. If you can get to the point where a team of six to nine people can independently, completely own a service end to end almost completely, there’s always gonna be exceptions. [00:27:00] That’s a good place to be, I think.

[00:27:03] Paige: Yes. And even in that system where you have the independent teams… If I was CTO for a day I would do the team API model. Every team is gonna have the namespace, IM channels, it’ll be #team-whatever. It will change as we reorg.

[00:27:23] I want to have weekly office hours. I want to know that I can talk to someone on that team, a real human, about a legitimate issue. Go through: who’s on this team? Where do they live? Why is their Slack channel named the lobsters? Like what? I just want to be able to answer who owns what and how can I get a hold of a human for a conversation?

[00:27:44] If you can figure that out for your org and standardize that, oh my God, you’re lowering cognitive overhead already. I don’t have to go on a whole quest to to find out my point of contact.

[00:27:57] Stephen: My old boss used to talk about the the team API model, I think, and I [00:28:00] didn’t really get it, but I’m, I think I understand what you’re talking about now. Sounds very cool.

[00:28:04] Paige: Google Calendar has it built in that you can set up office hours. I’m just saying!

[00:28:12] Stephen: Another thing would be so I’ve had Sebastian Vietz on the podcast before. He’s talked about he thinks there’s different pillars of SRE.

[00:28:23] Interpreted that in my own way, which is there will be a handful of like super geniuses in the literally every part of SRE.

[00:28:29] You and I are mortal and can’t do that.

[00:28:30] I’m not saying you won’t work on all aspects of it, but I think that people can be owners of a certain area or focus period of time, and I think that helps potentially with cognitive overload.

[00:28:41] What do you think?

[00:28:41] Paige: I would love if I were setting up an SRE team to do initiatives or programs. I love the idea of, ;hey, we are gonna tighten up our incident management, and this time we’re focusing on how customer success can get involved and do status page template updates [00:29:00] so that anybody can put out updates that are approved by corporate;. Wonderful.

[00:29:06] Clear beginning, clear end. You’ve got something you can measure going forward. And if you’re interrupted in the course of doing that work, say a huge fire that rises up with the observability bill that you’ve gotta go put out, and you’re the only team that can do it. You really have something to show the business.

[00:29:24] Hey, for my time we had option A improving our incident response and option B putting out this fire. Look at where I spent my time. Is that really at the end of the day how we wanted to use this very expensive team? Probably not, but you’ve gotta make that work. You’ve gotta make those trade offs so SRE should just be communications engineer.

[00:29:47] Honestly, it feels like we’re very much the conduit for a lot of interesting system convers.

[00:29:55] Stephen: That is a trend. I don’t know what it’s like across the industry, but the organizations I’ve worked at [00:30:00] recently, you would have a very small number of high level engineers. Maybe two or three people who would know enough context to see the full picture. Very Little of that across the organization. And it’s challenging.

[00:30:15] Paige: It is challenging. Those kind of superhero SREs that you’ve talked about, I’ve totally worked with some in my day. When I think about the factors how could I be that person when I grow up? Part of it is some of them had 20, 25 years under their belt. Some of them had built the internet.

[00:30:32] Like of course over time you’re going to habituate the nuances of a system. Where you see a spiky graph and you’re like, oh, I know what’s going on there because we, can grok this stuff better over time. The other part of it is what other systems have you been exposed to? Have you been exposed to systems on bare metal where you had to provision everything yourself and care about the security or you, like me, a baby born in the cloud [00:31:00] who doesn’t think about -I literally don’t think about the power supply to the building because hey, that’s Amazon’s problem. That’s Google’s problem.

[00:31:08] That’s the other part is how the senior architect SRE shares their knowledge. Find those differences in your path and another SREs path and that can help you build those bridges of understanding. Of why do we do this? Or why is this important? Or how did you know to telnet or whatever. That was a horrible example. How did you know to put in that query to get the answer? I always ask those ask those questions that get those experts making their internal thinking explicit out.

[00:31:43] Stephen: You were talking before about that sort of priority call based on, we are working on this initiative, we had to fight a fire. Is that a good use of our time? What do you think? I know that there are organizations that say you will spend 20% of your time or one day a week or whatever, doing improvement work.[00:32:00]

[00:32:00] I think it’s, you say that and then it doesn’t make any difference cuz people will just go and fight a fire if there’s a fire.

[00:32:06] There’s a whole bunch more that needs to go with it around the culture and honor again and saying you really need to honor this time because it’s really important and it has to come from the top.

[00:32:14] Paige: It does have to come from the top. That’s the part about SRE. If I were to say let’s just not do this anymore, is: quote, influence without authority. I’m sorry, that doesn’t make sense to me. I’m happy to see these observability aha moments, but just because I teach an engineer how to emit metrics or how to add labels. That doesn’t mean they’re gonna go do this for every PR if their boss isn’t telling them, making them aware, championing better observability.

[00:32:47] Ugh. I feel like we’re describing SRE as this like Sisyphean task of just rolling the boulder up the hill and then watching it fall and being like, all right, another day, another [00:33:00] boulder.

[00:33:02] Stephen: I think it can be, it depends on the context, right?

[00:33:04] Paige: Yeah. Totally.

[00:33:06] Stephen: The last thing I thought about actually came from my previous boss, Steven Gill, who’s been on the show before. He, was a big fan of stop starting new things and finish the things you started before even if a new thing comes along, which seems super important.

[00:33:19] Just finish what you did, so you’ve got it. It’s good psychologically, to say, we finished this thing, and you can move on and you can open up the mental space for a new thing and probably do a better job at it.

[00:33:30] Paige: Oh, totally agree. The Jira hack that one of my managers did is put a limit on how many tickets we could have open and work in progress. That column would light up red as soon as someone dragged in the I think it was one or two tickets per person. So you could ping pong between two things.

[00:33:47] So once we got the eighth ticket, that column lit up red. Whoa. Why is so much going on? Let’s talk about this. Why did you think it was important to bring this in I cosign this one. Agree.[00:34:00]

[00:34:00] Stephen: Any other thoughts on and the antidote or the cure for cognitive overload?

[00:34:06] Paige: With respect to the senior architect superhero the, person that you always go to, because you know that they will get you the answers you need quickly. While they have all of the knowledge, they’re not always able to teach. Teaching and mentoring is a totally separate skill than being an amazing firefighter.

[00:34:27] Those folks may need a little bit of guidance on how to get somebody from zero to I understand logs or zero to, okay, I can troubleshoot this bug on my own because I’ve learned PromQL or whatever it is. But you can’t expect that those people are naturally going to be these wonderful teachers. That’s not really what’s gotten them so far in their careers.

[00:34:53] Treasure those people appreciate them start to take the load off of their shoulders and help them.[00:35:00]

[00:35:00] Stephen: Thank you so much, Paige, for coming on the show today. We are gonna continue this discussion next week because there is so much more to talk about.

[00:35:11] Anything else you wanted to talk about before we sun off today? Any words of wisdom?

[00:35:19] Paige: Let’s start National Appreciate your SRE day. January 24th. We’re calling it!

[00:35:25] Stephen: It is the 25th here. I missed it. Damn.

[00:35:27] Paige: oh, dang it.

[00:35:30] Stephen: That’s all for another episode of Site Reliability.

Recent News

Featured Resources

Podcast: Slight Reliability | Cognitive Overload

Featuring:

Transcript:

Related Posts