Podcasts

Off-Call with Paige Cruz: SRE Is Ops With Boundaries

Featuring

Paige Cruz

Principal Developer Advocate

Paige Cruz is a Senior Developer Advocate at Chronosphere passionate about cultivating sustainable on-call practices and bringing folks their aha moment with observability. She started as a software engineer at New Relic before switching to Site Reliability Engineering holding the pager for InVision, Lightstep, and Weedmaps. Off-the-clock you can find her spinning yarn, swooning over alpacas, or watching trash TV on Bravo.

Amin Astaneh

Founder | Certo Modo, SRE consultancy

I’m a consultant who specializes in dramatic turnarounds for tech companies that struggle with operations and scale. After living in Boston for 12 years, I became a digital nomad to explore the US, learn new things, meet interesting people, and go to the occasional electronic music show. I enjoy sharing my knowledge and experience with others through writing, public speaking/webinars, and of course as a guest on podcasts!

Overview

Paige talks with Amin Astaneh about SRE, the evolution of system visibility spanning our careers and how to actually quantify toil.

What We Talk About

Defining SRE as “Ops with Boundaries”
WhatsUp Gold
Has System Visibility Increased Over Time?
Anti-Pattern: SRE as Keepers of Observability
SRE Education at the School of Hard Knocks
Reflecting on 15 Years of On-Call
Holidays On-Call
Tracking and Quantifying Toil
Should Engineering Managers Be On-Call???

Amin’s Recommended Reads

More about quantifying toil In Defense of Time Tracking on the Certo Modo blog
The Goal aka spiritual predecessor to the Phoenix Project

So you have sources of toil that are hitting your team, hitting your engineers, tapping all the time and productivity. What do you do in the beginning? As with anything in DevOps and SRE, we need the data to back up our decision making. So we need to track our time. I’m not saying that we track our time writing code, but everything that is toilsome, we’re tracking our time on it.

Amin

Transcript

Paige Cruz: Hi there, listeners. Before we dive into today’s episode, a quick note. Due to some technical hiccups with our previous recording platform, this is actually the third take of this amazing conversation with my guest today, Amin. So you might hear us referencing earlier attempts or past conversations, and now you know why.

Thanks for sticking with us, and let’s jump in.

Today’s topic is near and dear to my heart, Site Reliability Engineering, or SRE, and I am so delighted to be joined by Amin Astaneh, who runs Certo Modo, and SRE and DevOps Consultancy. Hello, hello! How’s it going today?

Amin Astaneh: It’s good to be here Paige. Yeah, thanks for having me on. Try number three.

Defining SRE as “Ops with Boundaries”

Paige Cruz: I want to start us off with the basics. What exactly is SRE? Because before, you mentioned this phrase that has been rolling around in my head ever since, that “SRE is Ops With Boundaries”. What does that mean to you?

Amin Astaneh: So let’s go all the way back to the beginning of the literature. SRE is what happens when you ask a software engineer to design an operations function.

So you’re taking all of the tasks normally done by hand by system administrators or ops people, and you’re automating it away with software, or at least managing that manual labor with software.

When we’re talking about boundaries, I was in classic operations for many years before I started doing DevOps and SRE.

And when I worked-

Paige Cruz: Classic Ops.

Amin Astaneh: Classic Ops. Yes. So in that model, teams like mine were subject to endless manual work and tickets and outages because there wasn’t a tight feedback loop between the work that we were doing and the engineering team.

Paige Cruz: OK. The silos, the wall, that wall that DevOps crumbled – you lived that.

Amin Astaneh: The wall of confusion, we had that wall of confusion. There was a period where the engineering team wasn’t even on-call!

Paige Cruz: Blows my mind

Amin Astaneh: I’m telling you, it was really rough. It eventually got fixed but in that ancient time, it was really, really painful because there wasn’t anything stopping them from releasing changes that continued to make the system less reliable.

Across all the various ways that SRE could be implemented – I usually do the embedding model. We’re placing constraints on how much incident response work we’ll take on. The common tool that we’re using for that is actually SLO. By establishing Service Level Objectives, if we spent all of our error budget, well, we’re going to slow down or stop the rate of changes to the system, meaning that the rate of incidents should decrease during that period.

That is a way to establish a boundary around the type of work that we’re taking on. Because toil and incidents, that’s unplanned work that’s waste to the business and it’s also hard on us as engineers.

Paige Cruz: Quite stressful. There’s never a great day for an incident. As much as I value learning from them, the experience, the growth, I can’t say, “Oh, you know what? I would love an incident tomorrow morning at 6am.” Not going to happen. It wouldn’t be better if it was 10. I mean, marginally…

I like framing it as unplanned work, that is such a business-y term.

Amin Astaneh: Sure. I got that from The Phoenix Project back in the day, right? That was the term that they used for it and it stuck with me.

Beyond SLOs we’re also trying to automate away aggressively all of that manual effort, as I mentioned earlier to put a cap on that type of work.

Then finally, some engineers in some organizations have the ability to vote with their feet by moving between teams. When I was at Meta you can decide to move from one team to another as a Production Engineer. Meaning if you are on a team that didn’t quite help enforce those healthy boundaries in terms of work-life balance and too much time spent on-call, you can leave the team and management has to deal with the consequence of that.

Paige Cruz: Wow. What I’m hearing is in the, dark ages pre-SRE, even if you wanted to say no to something, or even if you wanted to say,

“Hey development team, let’s have a little more care in our release process”

“Here’s why I don’t want cowboy coders running off and pushing to prod right away”.

Even if you had those feelings, even if you expressed and communicated that, there wasn’t a culture or expectation that would support you enforcing those boundaries. And today what’s different, why SRE is thriving in so many different organizations, is because SREs have more of that, I don’t want to say power or authority…

Amin Astaneh: Dev agency.

Paige Cruz: Agency!

And some of the things that helped along the way, DevOps bringing developers on-call, having the “You Build It, You Run It” model, there were sort of these parallel shifts in the industry to say:

“Hey, developers, we’re going to share more of this production burden. Ops is going to hand some of that delegation over to you, and let’s figure out how to work on this together.”

Amin Astaneh: Yeah, correct. It is a sharing, especially for teams that SRE is on them. The point isn’t just to leave the engineers in the corner to deal with the pain of on-call. It is about learning how to continually improve that experience. SRE should be in the rotation with them, sharing in that discomfort and pain, but also learning how to make it better through continuous learning.

Paige Cruz: Yes! Oh, I love that!

WhatsUp Gold

Paige Cruz: Let’s bring this back to your journey, now that we’ve covered the basis of SRE. Think back to your first experiences on-call, what was the monitoring tool you turned to? What dashboards were you looking at?

Amin Astaneh: Oh my goodness. I don’t know if I turned to it, I think it was more like, forced on me because that was the only tool around at the time. So there was this ancient tool that was made in 1996 called WhatsUp and I was running a version called WhatsUp Gold.

Paige Cruz: Ooh, fancy.

Amin Astaneh: I know. and the way that it worked is it just sent pings, ICMP, to hosts. And then if the ping packets didn’t come back it would send SMS messages. I was carrying an actual beeper, a pager.

Paige Cruz: Yes!

Amin Astaneh: Right?

Paige Cruz: Yes! I’ve been waiting, I’ve been waiting for someone to have the real physical pager. Okay.

Amin Astaneh: Right.

Paige Cruz: Dang. How long of a message could you get? Because now PagerDuty calls you and is like “Blah, blah, blah, here’s the 300 word alert monitor name that you have”. I can’t imagine you were working with that many characters.

Amin Astaneh: No, it was like tweet sized or less, because you have that itty bitty little screen.

Paige Cruz: It’s so small!

Amin Astaneh: With the LCD, so there’s only so many characters you can fit in there. So it would tell you the host name and the status. It was really rudimentary.

Paige Cruz: Okay. No emojis? You are reading all of this just raw text.

Amin Astaneh: Yeah, there was no UTF-8 at that time. We were still rocking ASCII, you know?

Paige Cruz: Oh my goodness. Okay, so WhatsUp Gold was basically a pinger, maybe, like a lighter weight version of what folks use today for synthetics checks?

Amin Astaneh: Yeah, but just ICMP. It’s not even, you’re not even doing, HTTP requests against a health endpoint or something a little more sophisticated. We’re really just-

Paige Cruz: Oh, it’s literally, “Are you up? Yo, are you up?”

Amin Astaneh: “Is the host up?”

Yeah. So that would not be useful for distributed systems, right?

Paige Cruz: No. OK. And then, I think in one of our earlier chats, you’ve also contended with friend of the podcast, or frenemy, Nagios.

Was that a big step up from WhatsUp Gold? Was that, at the time, mind blown, state of the art?

Amin Astaneh: Oh, absolutely. So we had the concept of hosts and services. You had the NRPE plugin, which ran on the hosts that can run a whole set of various checks. I wrote checks in Bash and Perl and Ruby or whatever. It was really good for the time.

Then of course, it began to support things like time series using PNP4Nagios and round robin database files, and it would actually print out the, the little itty bitty graphs. Of course it had limitations, but it was a paradigm shift for the moment.

Has System Visibility Increased Over Time?

Paige Cruz: Sure. And now thinking through the Nagios days through APM and we’ll talk a little bit about your time at Meta, (I know Meta’s got some pretty sweet internal tools) to the like massive explosion of what now are called observability systems.

You’ve really gotten to see from the humble ping that you got paged with on a physical beeper to pulling up a distributed trace across like a massive system today.

In your opinion would you say that visibility has increased? That the tooling for investigating issues in production has gotten better for developers and operators? Or has it just kept pace as the architectures change, we change how we monitor and we’re roughly at the same level.

Amin Astaneh: I think it’s really gotten better. With the modern tools, we’re able to quickly zero in on things in our infrastructure and in our systems that are failing. There’s an increased level of sophistication in how we consume our metrics. But there’s another side of that coin. If you’re starting an observability journey without discipline you can monitor all the things and then you end up with increased costs. You don’t really have a strategy in terms of how to emit and consume that data. So with that power, we also need a greater level of responsibility and judiciousness about what metrics we do emit, store, retain, query, and so on.

Paige Cruz: You’ve got to have a plan for management. And I’m enamored with “You Build It, You Run It” but I think we forgot to say you also are running or managing, you’re responsible for the telemetry.

Anti-Pattern: SRE As The Keepers of Observability

Paige Cruz: What I struggled with a lot as an SRE is, yes, I’m here to help make the system better, I’m here to help increase your visibility, help you use our observability platforms. What I am not going to be great at is making decisions about your team’s services’ telemetry. I don’t know what logs are most important to you or what could be totally tossed, and I shouldn’t be making those decisions, app developers should.

But there seems to be this line, and I don’t know if you run into this with clients or from what you’ve seen, where SRE is seen almost as the keepers of observability and owners more so than we should be.

I’m guessing yes, based on that reaction.

Amin Astaneh: 100%. And it’s really difficult because when you are the only SRE on a team, or maybe two or three of them, and you have a team of 10. There starts to become this separation between engineering work and SRE work. That’s a massive breakdown, right? An anti-pattern.

The way that it should work is that we SREs are introducing engineers to higher levels of operational maturity, which means we are showing them like, “Hey, this is how we thoughtfully use observability platforms.”

Do we review? Yes.

Do we provide feedback? Yes.

But we’re not the keepers of operational responsibility.

We are the guides.

Paige Cruz: There’s a lot of metaphors, I think one of them is SRE is a park ranger.

You are responsible whether you go on or off the paved trail. I let you know where to put your garbage, but it is still on you to follow the rules. I could lay out all the guideposts I want, but you’re still making choices.

Amin Astaneh: That’s right. Pack it up, pack it in.

Paige Cruz: Yes. Yes.

SRE Education: School of Hard Knocks

Paige Cruz: You started in classical ops. What does that mean for how you learned about monitoring and observability? Did you pick up most of your knowledge on the job? Was there monitoring 101 at college? How did you gather all this knowledge yourself?

Amin Astaneh: Oh, it was school of hard knocks from necessity, right? And in the early days, yeah, that’s how it went.

Because a lot of this tech was very new and it was based on an open source project. There wasn’t really a vendor, so you had to make it work, meaning we had to be experts.

And, over the years, I’ve touched so many, Nagios, obviously. There was Ganglia, because that was useful, when running grid compute back in the day. There’s StatsD and Graphite, which is still in use today, Grafana, Prometheus, even SignalFX, that vendor.

And what Meta does, I’ve had to become a local expert on all of those things. Only recently have we had this golden age of SaaS companies like building observability products and then writing amazing documentation and tooling and enablement and workshops, right?

Paige Cruz: Yes. Having worked for a handful of vendors. I say, you don’t need to start from scratch. Your vendor should be, and especially your TAMs and whoever is working to support your account, there should be so many resources at your disposal if you want to help accelerate the rollout of tracing or introduce this new feature or even optimize your costs.

Those vendors should have folks helping you out. That is what you’re paying them a lot of money for. It’s not just to store your metrics. It’s for improving your business through observability data. And that goes for any vendor, not just the one I work for.

Learning Meta’s In House Observability Stack

Amin Astaneh: Oh yeah. And it’s a big job to maintain and build and operate your observability stack.

Some teams decide to do it. In the case of Meta, because it makes sense. Some teams don’t because they want to focus on their value add, which is the product they’re building.

Paige Cruz: Yes, and thinking through your time on-call, even, like Meta is a great example. Meta’s got an internal tool they’ve built.

How was the documentation? How do you go about learning a tool that you’re never going to find anywhere else?

Amin Astaneh: Oh my goodness. They have a really interesting benefit that you have direct access to the engineering team that builds and operates it.

Paige Cruz: You’re in the same Slack or Facebook chat.

Amin Astaneh: Yeah, whatever their chat system is. Workplace, right?

So, the way that it worked- I had a similar thing for my team is that we had a group just for our team and it was like a support line and people could post questions. There was a process on how to ask for help and it incentivized them to update the internal wiki with “this is how you do queries” and “here’s all the various transformations you can do against your data”.

Organically over time because of course people don’t want to be interrupted with support tickets, they are constantly updating their documentation, so that way it’s deflecting a lot of those frequently asked questions, right?

So there’s already an internal incentive to keep documentation and tooling up to date.

Paige Cruz: I like that feedback loop.

15 years On-Call!

Paige Cruz: These days you work as a consultant- how many years were you on-call, slash, are you still on-call as a consultant? and how the heck would that work?

Amin Astaneh: Right. Yeah, good question. My first on-call shift was around 2007, and then I was at Meta through 2022, so we’re talking 15 years.

Paige Cruz: Woo!

Amin Astaneh: Right?

Paige Cruz: Clap, clap. I mean, that is an achievement.

Amin Astaneh: It’s a long time. A long time in the trenches. So I don’t do on-call presently as a consultant. But I did recently launch a service where you can hire me and my team of experienced SREs to be basically like an escalation path. If you’re stuck during an incident, you can call us in and we can jump in and assist you.

So I do offer it as a service, but at the moment,

Paige Cruz: Bat signal

Amin Astaneh: Yeah, bat signal very, yeah, very much like that.

Amin’s Final Page From Meta…

Paige Cruz: As we talk about, off call life, I personally do not think it’s ever fun to get paged.

Especially when I’m thinking about the weird hours pages, or in the case where, “oh my god, my team owns too much, I’m afraid I’m gonna get paged for a part of the system that the sub team works on that I have no freaking idea what’s going on with, yet I’m the first line of defense here.”

Amin Astaneh: Sure.

Paige Cruz: But, none of that was the case for your most memorable page. If you want to share the final page you got from Meta, what about it was stressful? And share the story with listeners.

Amin Astaneh: Absolutely, it is almost exactly two years ago.

Paige Cruz: Oh my gosh.

Amin Astaneh: If you remember two years ago, that was like the first wave of layoffs in our industry. So Meta was the first, one of the first big companies that did it, and I was affected by that and I was laid off that day. And you get locked out of your infrastructure and you can’t do your normal work stuff because you’re no longer working for the company at that point.

Paige Cruz: Yeah, the laptop gets bricked right away.

Amin Astaneh: Yeah, like that. But what happened was I did get a page on my phone from infrastructure I no longer owned.

Paige Cruz: A nightmare

Amin Astaneh: So I got paged after being laid off. So I had to reach out to my boss, using regular Facebook messenger and see-

Paige Cruz: Facebook for normies.

Amin Astaneh: Right.

“Hey, boss, man, I’m getting paged for this. I just want to let you know,”

Paige Cruz: “Bring this to your attention.”

Amin Astaneh: Yeah. So that was pretty wild being paged, from a monitoring system I no longer supported or worked for.

Paige Cruz: That you’re responsible for.

Amin Astaneh: Right.

Paige Cruz: From that experience, hindsight is 20/20 now, you’re working and consulting, you work for a lot of clients.

What is the takeaway for that? Like, when it comes to personnel changes, anecdotally, I’m like, yeah, you probably want to have checked the on-call schedules before you did this and made some changes.

What is the risk? Had you not gone to tell your boss. My mind goes to the worst case, that could have really spiraled, that incident could have been worse or gone on longer had you not come forward.

So what are the, what are the takeaways from that experience?

Amin Astaneh: Sure. There’s a couple of things.

Number, one, you’re going to need to actually measure the time it takes for you to de-provision access to your former employees.

And then number two you do need some kind of automation or something to ensure that there is someone on-call for every service that’s in your service catalog or whatever. For smaller organizations, that might be manual. It might be on a checklist. But for larger organizations, that needs to be enforced via code.

So in that instance, I suppose that their automation that does the de-provisioning of a human being from their system might not have hit the on-call stuff yet.

There could have been a delay. So like they locked me out of all the essential stuff, but then it probably took time to cascade throughout all of their system.

Paige Cruz: Which is wild. And I’m sorry that happened to you. Like I, on an already like stressful day, I mean, props to you for bringing it back to your boss. You must have had a great relationship, because I know a lot of people would have just been like, “Ahaha, good luck! Good luck with that.”

But you don’t become an SRE because you hate computers. You don’t become an SRE because you want things to stay the same. There’s a big draw in the improvement, and the whole point is, let’s keep these systems up and reliable for as long as possible.

Because it keeps continuing to happen- it’s not that I want companies to do it [layoffs] better, but I do think the point of your company is this product or service. Why would you jeopardize that? Do you not know how you’re jeopardizing it? Maybe it speaks to how siloed SRE or operations can still be for orgs that haven’t truly embraced their technological presence. I’m not sure. I can’t speak for those companies.

Amin Astaneh: Yeah, organizations can be so complex, right? So a decision to lay someone off has cascade effects that they don’t like unknown-unknowns come up.

You could lay off a person that has tribal knowledge about a really critical piece of infrastructure in their head. And they’re the only person that knows about it because they didn’t document it. So once they’re gone, that, that knowledge, that brain trust goes with them.

So the next time you encounter a problem with that component, it is now even worse, because you have to now excavate out of the ground the knowledge that that one person once had.

Paige Cruz: And rebuild from scratch. Yeah.

Amin Astaneh: Yes. Yeah. Unavoidable.

Paige Cruz: This is some great advice for organizations. I don’t know if they’re listening. I do know that we’ve got engineers listening.

On-Call Advice

Paige Cruz: What on-call advice do you have for folks who are joining a new rotation? It can be scary, especially if you’ve just joined a new job, maybe you were laid off or needed to get out of a bad situation, and on-call can have a heightening effect on stress. It’s a lot of pressure.

Amin Astaneh: Sure. 100%. There’s a lot to say. The first thing I will mention is that the first time you get paged, especially if you’re inexperienced with on-call, you’re going to have that flight or fight response. You’re going to have- seriously, like your amygdala is going to fire.

Paige Cruz: It’s terrifying. It is.

Amin Astaneh: It’s what it is. That is going to happen. And it’s important that you have some compassion for yourself when you are responding in that way. It’s completely natural.

But there’s a few things that come to mind about this. Number one, don’t be afraid to ask for help as the on-call, because your task is to be the first responder, and to assess and triage the actual impact and scope of the incident. And answer the question, “Do I have the tools on hand to solve it?”

You’re not going to be expected to understand the entire production system, especially for gigantic distributed systems like I was at Meta. That would be impossible.

Paige Cruz: There’s no way. Nope.

Amin Astaneh: Right? Yeah. So it’s cool to ask for help. That’s what second level escalation paths are for. For large companies. That’s why there’s multiple on-call rotations. if you need help reach, to the team that is best equipped to handle it, or reach out to maybe a senior member on the team or whatever you need to do. There’s nothing wrong with that.

Paige Cruz: Yeah. It does not need to be a solo expedition and it shouldn’t. Especially for how some of these effects can spider out to different other teams, responsibilities and areas.

Amin Astaneh: Exactly. And if there’s something that you just don’t know, then you can talk about it during the postmortem and identify the gap in process or technology. So it’s OK if you need to ask for help.

A couple of more things. This seems self explanatory, but you’d be surprised how often this becomes a problem. When you are going on shift, you’re assuming the responsibility of being on-call, especially for the first time. Make sure you can access production. Do you have access to the tools? Right.

Paige Cruz: Do a little dry run.

Amin Astaneh: Do a dry run. Try some tools, get on the system, look at the dashboards. You have access to everything, all the tools you need to respond since it’s-

Paige Cruz: Click the links in the runbook.

Amin Astaneh: Yeah. Yeah. Do all of it. And also make sure that alerting works.

Paige Cruz: That’s pretty foundational, yes!

Amin Astaneh: Make a fake alert and make sure it routes through the system, it hits you and blows up your phone. I remember when I was at Meta I wrote an On-Call For Noobs document for our team.

Paige Cruz: Oh nice!

Amin Astaneh: One of the things that was in that step by step guide was actually configuring your phone to allow the phone number that was used to send you alerts for that to bypass your Do Not Disturb.

Paige Cruz: Yessss.

Amin Astaneh: So you could have Do Not Disturb, set up so that way you don’t get chat messages or things that will make noises, but the pages will still go through.

So your phone will still notify you if there’s a problem. So that’s important.

Paige Cruz: That’s very important.

Holiday On-Call

Paige Cruz: I can remember a time during Thanksgiving holiday when of course I was at a company that was really nice and it wasn’t, “Oh, You drew the short end of the stick, you’re on-call all week for Thanksgiving.”

No. We, for every holiday, we took each day, and then I think we broke it up even to half days. And then we had people opt in. If somebody had dinner was important, but lunch wasn’t, they’d take the lunch shift, and that sort of thing.

But! It meant people were not on their typical routines and would have stuff like Do Not Disturb on.

We actually had a page escalate past primary to secondary, past secondary, I don’t think we had a tertiary. It went straight up to the CTO

Amin Astaneh: Whoops!

Paige Cruz: And it was just that, it was that kind of thing where Oh My God, I didn’t see it. It wasn’t malicious or intentional. It was just out of the ordinary routine. And Do Not Disturb was on, this stuff wasn’t checked.

Just a little anecdote that it sounds so simple, right? Make sure you can load your observability tool and that pages go through. But the reality is. We’re telling you this from real experience, like they are absolutely things you should be doing.

Amin Astaneh: Yeah, absolutely.

Since we’re on the subject, for holidays, make sure that on-call shifts during the holidays are fair. You don’t load Christmas and New Year’s and Thanksgiving on one person. Split it up, look at previous years and see who did what and make sure it’s fair for everyone. Because running production can be a pretty intense job and we need to spread the load properly.

The final thing I wanted to mention about advice for newbies that are on-call is, if you see something, say something.

If you see a tool’s broken or a runbook isn’t actually effective or it’s out of date or an alert was annoying, flapping, not actionable, that type of stuff.

File a ticket, right? Just track the gap.

For teams that do retrospectives on, on-call shift (which I highly recommend and I’ve written about to, do retrospectives per shift)

You bring it up and you can get the work prioritized and done. Any source of pain, little bit of pain for you, can translate into a lot of pain for people down the road once your system gets bigger.

So it’s important to track these things as you see them.

Paige Cruz: And having that list, I don’t know about you, but I am a very strong advocate, depending on the size of your team and your rotation that your on-call time is spent with production. It is not spent on project work. It is not about project work. It is your time to make the system better.

Amin Astaneh: Amen

Paige Cruz: Because if you’re the one person on the team that does that and nobody else does it is exhausting, demoralizing, and you as one person on-call every fourth or fifth week cannot tackle a big list, that you’re building up of all of these things you’re logging. But, if every on-call primary is spending their week making systems better, making sure the runbooks are good, it turns into that continuous improvement.

Everybody’s contributing. That’s just one of my many SRE soapboxes is on-call is for making production better.

Amin Astaneh: Oh yeah, I’ll join you on that one. For the product managers and the engineering team leads in the room don’t count the on-call capacity as part of sprint planning or however you plan your work.

That time should be 100% dedicated to the reliability system. So 100% agree with that sentiment.

Paige Cruz: That’s why having that log is great, because some folks love to be the curious explorers gathering all that information and bringing it back. Some people just want to execute.

I do think, at the beginning of your on-call shift, open up that observability tool or tools, depending on how many you have, and just get comfortable.

I would just love to go exploring, what has this service been up to lately? How’s the load been over the last week? What is the longest API response for the services I’m on-call for?

Like really using that on-call time to explore. If you don’t have a laundry list of things to get through, that can be your time to learn about the system and learning time is very hard to come by in some orgs. So protect your on-call time.

Amin Astaneh: Yeah. Wisdom.

Toil: Why It’s Bad and How To Quantify It

Paige Cruz: So in the vein of kind of productivity, how we’re spending our time.

There’s a real big push these days for developer efficiency and productivity and doing more with less and doing, above and beyond while we’re cutting budgets.

The thing that really traps people, whether it’s technical debt, whether it’s toil, those sorts of murky tasks stack up over time.

You can feel it, you can feel the work is sludgy, there’s too many processes in the way. I have to always do this script manually, but I don’t have time to automate it. For people caught in those loops, I know that you’ve had success tracking and quantifying specifically toil.

So give us the playbook, and maybe start with what is the cost of not addressing toil?

Amin Astaneh: Let’s talk about toil and what it can do to your team. So what we’re talking about is engineering efficiency.

Imagine an engineer is spending 40 hours, they have 40 hours a week to do their job. Now ideally, they’re spending 35 plus hours on writing features. As the quantity of toil increases, then they’re spending 20 hours on writing features, 15 hours on writing features, and then all of that manual work is waste. That is overhead, that’s operational cost. Which from a business perspective is really bad, right?

Paige Cruz: Not great.

Amin Astaneh: Not great at all. That’s why it’s important. You want to have an efficient engineering team so you’re getting the return on investment because engineers tend to be the most expensive members of your staff.

Paige Cruz: Aside from enterprise sales reps. If you have, if you don’t have a friend in sales, go make one. And just. Man, some of those comish checks. They work for it. I will tell you, they do work for it.

Amin Astaneh: Oh, sure. Yeah, there’s a sales guy a couple jobs ago whose nickname was Diamond Boots for a reason. Like he was making the dollars getting those deals.

Paige Cruz: Aside from that – engineering headcount, super spendy. Our time is very precious to a company.

Amin Astaneh: Exactly. So that’s the problem that we’re trying to solve. OK.

For example, let’s say you have a CI/CD system and you’re continuously testing the main branch and every once in a while the tests break and the tests are flaky and you have to go and jump in and investigate them and fix the tests. So that’s a source of toil.

Paige Cruz: Yes!

Amin Astaneh: So if you track the amount of time collectively as a team on that type of work, you then are able to then extrapolate the amount of FTEs this specific problem is taking up and the amount of dollars this specific problem is consuming.

Paige Cruz: Oh, the business loves when you talk money.

Amin Astaneh: Time and money. Absolutely.

So I actually wrote an article called In Defense of Time Tracking. And again, we’re not talking about tracking every single thing you do because that’s a pain in the butt and that doesn’t feel good.

I think people are incentivized to say,

This thing’s painful. I’m going to go and hit a stopwatch and put it in the ticket: “Hey, I spent a couple of hours on this” or use Toggl. I use Toggl for time tracking. That could be a very useful tool for this type of thing and then you’re able to quantify it.

So turning it into a dollar amount that’s going to be really useful for product managers because then you can say, “Hey, this flaky test thing costs half a FTE.”

and they’re like, “Yep, I’m putting that in the next sprint.” Like it’s a no brainer, you don’t have to sell it, they understand, the impact of it.

Paige Cruz: Yes.

Amin Astaneh: This whole principle for me, came from the book called The Goal by Eli Goldgratt, which was like The Phoenix Project before The Phoenix Project.

Paige Cruz: Okay.

Amin Astaneh: It was like the 1980s version. It was like a story and everything.

Paige Cruz: Yes! Okay, I’m gonna look that up.

What I think is so powerful about this is, when I was in SRE, I always said I practiced it almost as a developer evangelist or advocate. That was my flavor of SRE. And a lot of that is building one to one relationships with folks and their teams. And in conversations, you hear a ton of stuff about friction in the workflow, the flaky tests, or, “OK, you thought it was a big deal that the Go version upgrade broke stuff, but actually we’ve got a great handle on it that was a one off.”

It was in conversations that I said, “Oh things I thought were a big deal maybe wouldn’t have been to the developers” and vice versa.

What I like is this quantification takes what we feel and know and what’s frustrating and instead of venting to our bestie in Slack, which of course we can still do, if you want to change and you don’t want to be in this position- quantifying and telling the story with data is the way to affect that change.

Because I will tell you complaining about pipelines in your team’s channel that’s not the way to get stuff done. It may feel good in the moment, but it’s not going to get you to a better operational cadence.

Amin Astaneh: Yeah, absolutely. This is proven in my career.

Like my career development was an effect of doing this. I’ll give you an example. So back in that ops team that I mentioned, I was a manager on that team. I created some tooling around helping the ops team track all the time for all the work that you were doing.

Cause they’re more IT oriented and we were classifying the work across the four dimensions of the types of work in The Phoenix Project.

Paige Cruz: Can you remind me?

Amin Astaneh: Yeah. Business projects. Internal projects. Operational Change, releases and things like that. Then Unplanned Work.

So we’re going back to unplanned work again.

Paige Cruz: Ah, here we go.

Amin Astaneh: I wrote some tools that actually talked to Toggl and Jira and everyone was able to track their time a little bit more easily than manual. Then it generated a Grafana dashboard with the collective percentage of those four types of work across the entire team.

Paige Cruz: That is a view people would pay very good money for.

Amin Astaneh: Yeah. So I then scheduled meetings with the C-level, including the Chief Product Officer that was recently hired, Christopher Stone and I’m like

“Here’s the dashboard. This is the amount of money that we are literally setting on fire keeping the status quo.”

And the response he gave was he got budget for a Site Reliability Engineering team, got budget for an internal tooling team, and put me at the head of it. That is what happened.

Paige Cruz: And that’s how the seeds of SRE are watered. Oh my gosh. This quantification business. Gosh, it’s like data should be the foundation of our decision making at work. Oh, what a wild concept.

Amin Astaneh: Who would have thunk? Yeah.

Should Engineering Managers Be On-Call?

Paige Cruz: So I’m learning you have not only been classical ops, you have been SRE, you have also been SRE manager. So I really want to know your perspective on should managers be on-call? How high up the leadership chain should folks be carrying the pager? And should managers be just an escalation path, or should they participate in primary rotations? What are the trade offs here?

Amin Astaneh: Yeah, good question. So it depends.

Paige Cruz: Yeah, favorite answer.

Amin Astaneh: It depends.

So in small teams, let’s say you’re a manager that’s starting a team, you’re hiring people, you have to model the behavior that the rest of the team needs to do when they’re on-call. So you being in the rotation gives you the opportunity to show them how it is done.

Paige Cruz: Set the example. These are the values. These are the expectations.

Amin Astaneh: Especially when you’re building the team. Then when you do have an established team, maybe I don’t know, four to six engineers, I would recommend that you still hang out in the rotation.

Paige Cruz: Okay, surprising.

Amin Astaneh: And I’ll tell you why. You want to remain in touch with the operational realities of running the service. If it’s hurting you when you are on-call, then you have a clue that something is off. Especially if the team isn’t complaining about it, but you’re feeling it, then you know there is some level setting that we have to make in terms of really paying attention to the on-call state.

At that point, that’s like the outer bounds of when you should remain on-call.

So if you have a larger team, you should take a step back, allow the team to own on-call. Cause hopefully you have some senior engineers that can provide direction and drive it themselves cause they have the context. They’re doing it every day.

Paige Cruz: Follow in the footsteps

Amin Astaneh: Cause you need to build your succession plan who’s going to be next in line when you leave the company. You have to think about these things.

When you are a senior manager, maybe director, there is a path for you. Because, there is responding to incidents, but there’s also this thing called incident management. Which is for the big ol’ incidents, you have to start thinking about: Do customers need to be notified? Do the C-level need to be engaged? Are there multiple teams that need to get involved? What if a team gets stuck and they don’t know what to do? What if we need to escalate and pull in a bunch of people? How do you coordinate that effort?

So at Meta, they have the IMOC Incident Manager On-Call and they have a rotation of senior managers and directors and all they do is the incident management. It’s usually for incident levels, like for us, SEV levels higher than a two, which was like a major component is failing and it has a business impact.

Paige Cruz: The big incidents that require a lot of coordination and collaboration between groups who may not collaborate on the daily. Interesting.

Amin Astaneh: And the IMOC thing I think is optional. Because at that scale you don’t necessarily need to be in it, but it would probably be a good idea for you to. But once you’re like senior manager, director, you should probably get out of the way, especially with the tactical nitty gritty debugging stuff.

Paige Cruz: Yeah. Wow. Okay. This is so interesting because when I burned out from SRE, a path that I thought I would take with my career is, “Oh, maybe I will manage and create the SRE org or team that past me would have flourished in.”

And I ultimately walked away thinking, no, it is still too close to the experiences and the responsibilities that caused me to burn out and to be really stressed. And I don’t think I could go back in the fire.

So what I’m hearing with your advice is yes great engineering managers in the certain context where it makes sense are on-call. They are so close to the pulse of what you said, the realities of operating the system as it is today. I think that builds the true empathy, not sympathy, where I feel bad for you, but the empathy, I feel bad for you because I’m in the same boat. We are wearing the same shoes.

Amin Astaneh: I feel bad for us.

Paige Cruz: I feel bad for us. Yes!

SRE is For Everybody

Paige Cruz: This has been such a lovely chat. Are there any final words of wisdom, advice, or anything you want to leave folks with when it comes to SRE? What should folks, if they could change one thing that they think about SRE, what would it be?

Amin Astaneh: SRE has a very interesting connotation in this industry. It came from Google, of course, they wrote the book on it. And therefore the literature and the mentality around Site Reliability Engineering is that it’s only a practice that is appropriate for large organizations. I feel and believe from my own personal practice as a consultant, that SRE practices can be super effective on teeny tiny startups.

Paige Cruz: SRE is for everybody.

Amin Astaneh: SRE is for everybody. I’m developing this process called Lean SRE that I use in my consultancy, based on my experience at Meta and Acquia where we’re inserting 1 engineer on a team and bootstrapping the reliability process. I call that Lean SRE and have been using that.

Paige Cruz: Oh, that sounds like it will have very applicable takeaways for folks listening. I will underscore all of that. You don’t have to be a FAANG hyperscaler to benefit from SRE, better reliability practices. It’s not a tool. It’s not one person. It’s a set of cultures, practices, and tooling, just like how DevOps is.

All right. We have given listeners so much food for thought. I hope that you have learned something new about SRE or maybe changed a misconception that you held about it.

You will be able to find links to Amin’s blog, check out the services at Certo Moto in the show notes. Thank you so much for being on today, Amin.

Amin Astaneh: Thank you so much.

Recent News

Featured Resources

Off-Call with Paige Cruz: SRE Is Ops With Boundaries

Featuring

Overview

What We Talk About

Amin’s Recommended Reads

Transcript

Related Posts