Podcasts

Podcast: How observability provides insights for businesses & why data control is vital

Chronosphere’s Developer Advocate Paige Cruz sits down on this episode of the Enginears podcast with host Elliot Kipling to talk about typical engineer challenges, how Chronosphere has helped companies like DoorDash scale, Paige’s career journey, and where Chronosphere will be in the next year.

Featuring

Paige Cruz

Senior Developer Advocate
Chronosphere

Paige Cruz is a Senior Developer Advocate at Chronosphere passionate about cultivating sustainable on-call practices and bringing folks their aha moment with observability. She started as a software engineer at New Relic before switching to Site Reliability Engineering holding the pager for InVision, Lightstep, and Weedmaps. Off-the-clock you can find her spinning yarn, swooning over alpacas, or watching trash TV on Bravo.

A man in a white shirt is observing the camera with an expression that reveals insights.

Elliot Kipling

Director
Artifeks

Transcript

Ladies and gents, welcome back to another Enginears podcast. Today, I’m joined by Paige Cruz, who is a developer advocate at Chronosphere. They’re US-based, and they’re a company centered all around the lovely things of observability.

Paige is going to walk us through some really interesting case studies with some huge names over in the US – DoorDash being one of them. We’re also going to talk a little bit about her journey from software engineer through SRE and varying levels of that, into the dev advocate space and more engineering challenges as usual in between. So Paige, thanks for joining us.

Paige: I’m excited to chat [about] all things observability and what’s been going on in the industry lately … This is one of the true joys of my job as a developer advocate: Getting to talk to the broadest possible audience I can about why observability really matters to them at the end of the day – whether they’re a VP, director, manager, or an entry-level software engineer. Actually, observability is important for all of us.

Elliot: Yeah, it really is. We were talking offline 10 minutes ago about the need for observability with all of the chaos in 2023. I think observability, now more than ever, is so critical to businesses to know what their systems are doing, what their systems are saying, some of those systems being super critical, productivity of people, et cetera. The list goes on and on. So, it’s really key. Do you want to give Chronosphere an introduction and tell us all about the business and what they do?

Paige: Absolutely. Chronosphere was a startup born in the pandemic. We are just over three years old now. We’re born from our founders, Martin and Rob’s experiences working on challenges with getting good observability at scale over at Uber, when Uber was reaching super high growth, back when the prices were super low to call Ubers, times have changed. But their experiences working on M3, the metrics database over there that they’ve opened-sourced, really showed them that this wasn’t a problem unique to Uber. This was actually a problem worth addressing for the industry at large. And so, they spun out of Uber and started Chronosphere. Within three short years, they’ve really tackled, first metrics. Our philosophy is embracing open-source instrumentation.That is the way forward. The days of installing a proprietary agent, getting locked into one vendor’s specific ecosystem – those days are almost behind us.

We have to get a few more folks to migrate, but those days are numbered. So, they said: “What are the open-source standards when you look to metrics?” That’s Prometheus. Prometheus has been there helping us understand our Kubernetes clusters for years. And when it comes to the other signals, like traces, OpenTelemetry is finally putting its roots in the ground, and adoption is spreading pretty wide. And not only for traces. I think it can get pigeonholed as a niche tracing implementation. But actually, if you have not looked at OpenTelemetry in the last two years, we have added metrics, and logs are experimental and in progress So, OpenTelemetry is really unfolding its umbrella to be a singular pipeline, and delivery mechanism for all types of telemetry. And why is that so important? Why would you found a whole company around open source observability? Why not build your own agents?

Well, customers that are on OTel and Prometheus, are having control over their instrumentation and data, having the ability to, at least with the OpenTelemetry Collector, be able to process, add metadata, add filters, drop things, aggregate – to be able to do that in your network before sending it over to a vendor helps you with costs, even compute efficiency.

The other benefit: one is more control. It’s your data. You should own it. When a situation or a vendor doesn’t work out, or perhaps you’re running your own DIY Prometheus installation, which is a totally great way when you’re starting your company and you’re small out and the data loads are manageable, why not start with DIY Prom? But at a certain point, as your business scales, the more effort you’re spending in keeping your monitoring system alive, instead of putting those SREs or ops folks on helping your product become more performant and reliable, that’s a business disadvantage. At a certain point, it makes sense to say: “What would it look like if we sent this to a vendor and freed up more time?”

That’s really what drew me to Chronosphere. One, I didn’t want to work in proprietary instrumentation anymore. I see where the puck is going, I want to be there. And the second is that when we talk about efficiency and developer productivity, sometimes when I’m reading these articles online, it really sounds like you’re blaming the developer. I want to take a second because we have been shifting everything left, to developers who are supposed to be testing experts and performance engineers and SRE. And, they have to run and operate their own stuff so they’re maybe touching Terraform. We’ve put a lot on developer’s plates, and now the headlines are all about developer productivity. That is a lot of pressure to put on them. And for me, observability needs to be there to help you. It’s there to understand your systems. It’s there when you get paged and you get an alert. It means your alert has context. So your investigation is jump-started instead of figuring out: “Should this alert even fire? Is there actually a problem? Why did I wake up for this or leave my movie?” Or whatever it is.

I guess in a nutshell, I’m excited about Chronosphere. I joined, and I’m proud to be advocating for it, because open source instrumentation is best for the customers. It’s great for us as a vendor, but it’s really more about the customer experience. I have done some observability migrations, and they are my least favorite projects. They’re getting easier to do – the more and more tooling that is built. But, it’s a big lift for your company and you want to make the right choices. So, it’s about control and ownership of the data. And then for me, that transparency around how much it’s going to cost.

We had talked a little bit about this FinOps book I’m reading, which is a mashup of finance and DevOps, and it is drawing a lot of parallels for me with the cost of observability. They mentioned this term called a “spend-panic,” which is when a company has moved to the cloud, the bills were fine for a while. But at a certain point, enough new instances are getting spun up, or some data store for a developer environment has been running for three years and nobody even knows how to access it. And you have this moment as a leader where you say: “Whoa, the bill is too high and we’ve gotta bring it down.” That spend-panic moment is very much the same for observability. You can see a lot of headlines these days about eye popping observability bills.

Martin, Rob and the team at Chronosphere think that is not the way that it should be. And part of what they built is called the Control Plane, and it gives our customers the ability to filter things, drop them, and aggregate these metrics. Do you want us to pre-compute these really expensive queries? So, when you open a dashboard, it is snappy, because, if there’s anything a developer hates, it’s waiting for more than like three seconds. Not a patient bunch.

Elliot: There are some nuggets that I want to pick up on – the fact that, especially at a global scale of a business like Uber, you work on an open source project, or you build some tooling internally that you think will be valuable for the business. You open source it, you build a company around it. We featured about a year ago now, a team at Incident.io who did something similar at Monzo Bank. And it’s fascinating to see that project grow into a real business, because it is a real business need across other businesses regardless of your scale.

So, that’s a nugget in there that I find really, really fascinating. The FinOps part – we’ve spoken about this as well. Some costs are astronomical. I think to really gauge, especially what I’ve seen over the last couple of years running this podcast, observability is a topic that we talk more and more about in different ways. But, I think cost control and cost visibility is really important to businesses at varying sizes, especially if you’re a small startup and you’re thinking about how to get something off the ground, but keep our costs as low as possible. Just having that understanding, having that visibility of where your spend is, is key.

Paige: Absolutely. The thing that I think the philosophy of DevOps itself is always preaching, is that continuous improvement, that the technical decisions you make today, the architecture that you choose today, that is fine. You’re solving the problems for the current context you’re in, in the moment. That does not mean that that decision should stand for the next three years. Just as people re-evaluated monoliths and moving to microservices, I think if you have not yet explored tracing, now is a great time to see when and how this could help your engineers. If your metrics or logging bills are just so high and your executives are in a spend-panic. What I see is, people really quickly react, and do whatever it takes to lower the bill. What I would love to see, and things that Chronosphere helps you do, is understand what data is not valuable to you.

So, we make it really easy to find the low hanging fruit of: “Are there metrics that nobody has manually queried for, that aren’t even charted on a dashboard, that aren’t a part of a monitor, that literally no one has been looking at?” That is a pretty safe thing to just cut. You can always re out a metric back if you need it, but we’re helping you make more intelligent decisions about what data you need and your engineers rely on, and what you don’t – because our message isn’t: “Cut costs at the expense of everything,” but really, getting the most usage out of the data that you’re paying for. And so, one of our features called Metrics Analyzer, I would have killed for about a few years ago in a role where I had to do trimming of the bill, and I was like: “If I could go back in time and just have Chronosphere, this would’ve saved me months, and like handcrafted Google spreadsheets with awful formulas, I would’ve loved this.” And so, I’m really proud to represent all of the things, all of the parts of our product that are there to help you. We don’t want to have an adversarial billing relationship. We want to help you, because we know if you’re happy with your data and you’re happy with the experience, you’ll stay with us.

Elliot: Throw us a couple of engineering challenges that Chronosphere feel like they have to tackle on a daily basis.

Paige: There are a couple cases. One, which we have mentioned before, is where you are outgrowing your DIY observability stack. And it’s not just kind of small early startups. There are a lot of mid-size and larger companies that still do run their own stacks, and have whole teams that are staffed to keep that operationally alive. And for me, I always came from working at an observability vendor or using a vendor’s product. I hadn’t really seen or understood the challenges of DIY self-hosting until I got here. And I started to ask questions and we would get nuggets from engineers like: “Our monitoring system goes down frequently.” I said: “Excuse me, excuse me?” That is the one part where, no offense to the rest of your product, but if you can’t monitor it, that’s very terrifying as somebody holding the pager. You’re “flying blind.”

That’s the worst case, right? You’re in a really bad spot if you’re self-hosting, and things are going down. The other part is just the performance. There are some really awesome queries that are very expensive in time and computation to run. And what that translates to at the end of the day for engineers who are troubleshooting is a dashboard with just that infinite loading spinner. The rainbow beach ball, if you will. When I am responding to an alert, I don’t want to see a loading spinner. I want to see that data, I wanted to see that data yesterday. The cognitive overload of responding to incidents in these complex systems, you are having to hold a lot of state in your head about which component is breaking, and what that means for your users, and some awareness of the infrastructure. Is this everybody, or a particular region? Maybe just one node. The longer it takes for your data to load, the more you’re having to really hold onto that information, not let any of it drop, just so you can correlate something that you saw from the alert or whatever it is. And so, when we talk about developer productivity, you have to have a responsive observability platform, because we need to move at the speed where you think: “We need to get out of your way as much as possible and just be helping you and giving you insights and guiding you along the way.” So, that’s one challenge; that for DIY, there are challenges as your business grows, both staffing and operating it, and then also just the user experience of it.

A second challenge that I see for a lot of companies are ones who are ready to move away from proprietary instrumentation. They want to embrace open source, they want to embrace OpenTelemetry and tracing, but they’re not quite sure how to navigate from where they are, to that nirvana oasis, because it is a big lift to re-instrument. The reason some of these legacy incumbent vendors have so many customers on their roster, is because it’s very hard if you’ve installed their agents to rip them out because it’s a project where you have to ask every single team to do work. Or maybe you have a very good centralized SRE or central observability team that can help provide tooling, but at the end of the day, you’re touching almost every component in your system, from infra to application. And for those projects where, I don’t know if you’ve ever tried to get more than like 50 people to do something, it’s a slog.

The approach that I’ve been taking now, and where Chronosphere really can come in to help is: Let us guide you. Let’s piecemeal, let’s break this up. Let’s instrument maybe your most critical services. We’ll re-instrument those, and bit by bit we’ll rip and replace as we go, but we’ll be here to help answer questions about how spam sampling works if we want to do tail-based in the OpenTelemetry collector. We can help guide you to beef up your open source knowledge, and you don’t have to feel alone. For people who want to embrace OpenTelemetry, I think there’s a little bit of fear [of]: “Has it been tested? How do I get help? There’s nowhere I can file a support ticket and expect an answer.” And so, it is a bit of a mindset cultural shift to thinking about knowledge that you gain from Prometheus and OpenTelemetry as really durable. Engineers can take that from company to company since these are becoming the standards. It’s going to be easier to hire folks who know both of those already. I’ve had to onboard many different proprietary vendor tools.

They’re all radically different. They’ve all got different query languages, different interfaces and different instrumentation libraries. And that is honestly a lot for engineers to keep up with, and I think is a contributing factor to why we have so many observability knowledge gaps. Adopting open source observability as a journey, one that you’re committed to, but not something that needs to be done overnight – something that you’ll continue to see the benefits of as you roll it out and have a good partner by your side, like Chronosphere and our awesome customer success team or sales engineers. Knowing that you can partner with a vendor who will help you get the knowledge and skills you need, eases that risk a bit.

Elliot: I can now start to see the vendor lock-in challenge: You explaining it, running me through some of those challenges, how it’s multi-team and how it isn’t just one vendor, but multiple. That’s the real challenge.

Paige: When you think about OpenTelemetry, a brief history: The last team I was on at my first company, New Relic, was building out their tracing product. I have been thinking about distributed tracing for seven, eight years now. I’ve forgotten a lot of stuff. I’ve learned a lot of stuff. But, I’ve spent more time than the average engineer thinking about how to make tracing work, what the challenges are. People ask a lot if tracing is dead, or: “If tracing has been around, why isn’t everyone using it?” And I’m like: “Well, it took us a long time to get one standard.” We had to merge two competing open source projects that already is a lot of work. We offered every major vendor a spot at the table to help us craft this future.

If you think about what we’re replacing, these incumbent legacy APM and monitoring vendors have had 10, 12, 15 years to develop their agents and build the infrastructure and tooling needed. We are replacing that writ large with a standard. And yes, that will take time, but if you just follow the OTel release notes, you’ll know that there is a constant steady drumbeat of improvements and new functionality being added. If you want to wait, see how things shake out, it’s trending very well for completeness and for bringing all the signals together. So, if you’re a little on the fence, check in with OTel in six months, make a note to keep checking in and just see where it’s progressed.

Elliot: We’ve spoken offline about a couple of case studies. One, we’re going to focus on his DoorDash. There’s another one I really want to talk about, RobinHood. But let’s touch on the DoorDash one. So would you like to paint a picture for the audience to just understand the DoorDash case study and how you’ve helped them scale plus other things?

Paige: For anybody that doesn’t know, DoorDash is like the American version of Deliveroo. When you think about the core model there, you’ve got a user who’s ordering from a mobile device, food from a restaurant who they have not spoken to. And also, we are now dispatching somebody to go from point A to point B. A million things can go wrong along the way. Traffic could be affecting that wait time, maybe you accidentally dropped the soda and you had to go back. There’s a lot of things that can contribute to a poor customer experience. And the more data you have, I think the better about that, and the stats. And so that’s DoorDash in a nutshell. I love it when I’m on the road. I’m always DoorDash-ing to my hotel. Big fan.

How did they get connected with the Chronosphere? Well, they were unfortunately experiencing a lot of data loss with their metrics. When we think about metrics, I think sometimes they get a bad rep as: “What are you going to do with the metric? It can’t tell you where problems are. It can’t tell you why something’s happening.” Well, metrics tell you that there’s a problem. I think that’s a pretty good checklist item. Metrics are what powers all of the alerts that are paging engineers all the time. When a number passes a threshold, that is a metric. I think we need to show metrics a little bit of love. And that kind of sets up why losing metrics or intermittently losing metrics is actually a big issue – because we do rely on them to be that pulse check on how our system’s doing.

DoorDash was using a bit of StatsD and one other solution to monitor their VMs. But as their technical system grew, as they added more engineers, as more services came online, their monitoring system wasn’t able to scale with them, and it kept breaking down. Specifically, their lead had said that they were experiencing constant packet loss, which are three terrifying words, all in that order. Unfortunately, their system was really brittle and developers could accidentally break the whole system by making some change that they thought was going to be totally fine. But one little bug would crash the whole system for everybody’s monitoring and metrics. As an engineer, I would be terrified to make changes to that. Even in a blameless culture, I don’t want to be the one that brought production monitoring down. As an SRE, I would be constantly worried when the next breakage was going to happen.If it is that brittle and easy for inadvertent accidents to happen, not malicious, I’m going to be on edge every time I’m on call for that team.

When I think about leadership, I could understand a leader being like: “Why can’t you keep the system up? It was working fine forever” if you don’t have that knowledge of how to scale systems and what makes that challenging. And so, what they found out is they had a big “noisy neighbor” problem. This was just not acceptable for them. They’re a very data-driven organization. They’ve actually got some very cool business statistics. I think they’re one of the companies, if you are using your observability and monitoring tool at its fullest potential, you do have business metrics in there and you have people that are outside of software, operations and product looking at it.

It’s not that they didn’t appreciate or value data, it’s just that the system they had, had reached its breaking point for where they were. When they were looking for a solution, again, because choosing an observability vendor or solution is a decision you want to be very confident in, it’s one that affects everything and everybody, people have to get trained and it’s not something you want to churn on constantly. So, they set out to say: “What are the criteria for our next vendor?” They wanted open source, they wanted obviously scalability since that was their primary challenge with their old system. In addition to scalability, reliability. So, the more data we throw at you, can you handle that? Will you be there for me when I get paged and I need to know what’s going on.

And then, they needed it to be fully distributed. I think if I flip that on its head, we do provide single tenancy. So when we’re talking about noisy neighbors or one metric’s change affecting other people, we give each of our customers their own tenant so no other customer can affect your experience on the Chronosphere platform. So I think that helps address that a bit. What happened after working with Chronosphere to move on to our Control Plane to understand the data and the metrics they were using, one quote is: “We don’t even discuss metric loss.” They went from constant packet loss to not even an agenda item. They got four nines of platform reliability for us. We do not promise four nines, I believe we promised three nines, but that’s what we have been delivering to them.

In addition to that, some other fun stats are that with our Control Plane, we give you that ability to filter, drop, aggregate, pre-compute. We really give you knobs to transform the telemetry data flow before it hits our storage. And once it hits our storage, that’s when we start charging. But, all the way up until then when you’re making these changes, or if you were sending local development metrics in and you were running a load test, and maybe you had a bunch of high cardinality data because you wanted very interesting load test results, and the observability team said: “Hey, you’re at your quota.” You could add a rule to say “let’s drop things with this tag pages load test.” When you’re affecting the data volume that’s hitting the data cardinality, that’s hitting our system, that is also saving you money at the end of the day.

In DoorDash’s case over three years, they saved $42.5 million. That’s a lot of headcount. That’s a lot of burgers. I could live on that for a very long time. And that’s because they had this control over their data flow, to make sure they were getting the most value out of what they wanted to store.

Elliot: That’s a pretty incredible sum of money, and the irony around FinOps.

Paige: And the thing is, the functionality that we provide with our Metrics Analyzer is fairly new and novel. A lot of vendors make it a little bit complicated to figure out when, where and what’s causing the bill to spike. Before, it wasn’t in their best interest to make that data available. Sort of like how your cloud bill is like: “How many SKUs, how many different ways can you charge me for S3?” I think the answer is like 48. There are a lot of different combinations.

And so, because not only did we make these tools accessible for people to modify and understand their data, we also made it easy to do. The interface is simple. We let you sort and filter by most and least valuable, and most data points per second – because obviously more data points per second, more data, more cost. For just a little mention of scale, DoorDash is sending about 840 million active time series.They have no shortage of data over there when they say they’re data driven, it is true. So it is not like we are cutting into the fidelity or ability for them to pull great insights, just by cutting this data that they didn’t need or rolling things up. Our whole idea is that you’re balancing that cost efficiency with the effectiveness of the data, and that those levers are for you to decide for your business context. But DoorDash had great success, and I’m a big fan.

Elliot: And I think with so many developer tooling products nowadays, developer plugins, customer experience, right? And you’re talking about dashboarding – integrating certain parts into what you want, see what data you want to see, the Control Plane dashboard. Customer experience is so, so key nowadays to getting something right in that regard. So, well done.

Paige: We have a few on our blog if you want to leaf through, because these savings don’t just have to be for super high scale [companies], or you don’t have to wait until you have competent packet loss to come call us. It’s in everybody’s best interest if you start proactively exploring your options, even if you’re not ready, and see what’s out there – do some napkin math to see what could be possible for your organization, because I honestly don’t know anyone who’s company’s like laughing, throwing cash in the air. Everyone’s pretty much tightening the belts these days.

Elliot: Well, we obviously know now, Chronosphere is a great company, great product, great mission, and where it’s come from. Let’s talk a little bit about your journey: software engineer, varying levels of SRE through to dev advocacy. Talk to us about that evolution.

Paige: I’ll say my relationship with technology has changed over the course of my career. I joined tech recruiting and a little bit of HR management running an intern program. And this was when everyone was parroting that software is eating the world. And I was like, you’re kind of right. I was using Google Docs in high school. I remember this software called Blackboard where some teachers would post homework, some teachers struggled to use it, which I think speaks to the challenges we have in technology – making it intuitive and accessible for everyone.

I thought: “Okay, there’s something to this technology thing. I’m working with these software interns.” I took a Python class in college. I studied engineering management. And so I was like, “I know what a for 4Loop is. Maybe I could ride this software wave. I looked around, saw a lot of other women who were really happy in their engineering roles. Being in HR, I knew what people made and I was like: “Hm, I could buy a lot of alpaca if I was a software engineer.” What really started it though, and the through line for my whole career, is the power of data, in context, to make points-in-effect change. And so in my case, I was doing this recruiting coordination role, and I said: “Look, our company is growing so fast. I am booking more and more interviews than I have before. I’m having to do this overtime. I love overtime pay, but don’t always want to have to do it. I think we need to hire someone else.” And they’re like: “Yeah, we hear you.” And then, I Googled online and found a Google app script that would count calendar events matching a substrate.

So, I copied and hacked away a little bit until I got it to count up. The number of interviews rolled up week over week for that last probably two or three quarters to show: “I’m not being dramatic. This is taking up a lot of my time and I would like to raise the flag now.” So I take my little chart into my HR meeting and I’m like: “Look at the data! This is wild.” And at the time, ATS or applicant tracking systems, we were getting our heads around data analytics, but it wasn’t like today where there’s dashboards and charts all over the recruiting pipeline. And so, I was able to provide this view that they weren’t able to pull from the ATS and all of a sudden their eyebrows raised and said: “Wow, no we’re not gonna hire somebody because we don’t have the budget. However, we really want you to make this report for the sales interview calendar and the marketing interview calendar. We want to see all this data together.”

And I thought this, now this is interesting. What before was like a feeling-based conversation, I now have brought some facts. And while I didn’t get the outcome I wanted, I saw how relevant and interesting data was at providing better conversations, and people were asking more questions. We were able to share some of the load across all the coordinators, because we could see who was snowed under. And so that to me was like: “Okay, if I can figure out how to explore and handle any bit of data that’s available to me, it’s fun and I think this is a way that I can start to advocate for more changes.”

I went over to Hackbright, which was a three month software bootcamp, learned more Python, came out and was really not sure what to do. Even 5, 6, 7 years ago, getting an entry-level software job with a non-traditional background was very difficult. It is even more difficult today, but because I had worked for a software company, they did hire me back. My first team was very meta. We were building the demo for a monitoring company. So, to show off how good our monitoring platform was, we had to write really bad software that did awful things like N+1 queries and horrible timeouts. And I was starting my career writing bad code, on purpose. It was just very funny to me.

And from there I thought: “Well, this is interesting. I’ve learned a little bit about selling software because for our production environment, if we were down that meant any sales person who was in a demo had a real awkward time and had to cover for us.” And when you are not the face, when you’re not the external person that’s talking to customers or prospects, you don’t feel that. You’re not the one whose face is out there having to apologize for reliability issues. And so, the more I talked to sales, I was like: “Oh, it’s really important what I’m doing. However, at a certain point I said: “Okay, we had finished a couple of big projects to make the demo environment spin up, spin down, with Terraform. And so, the project was in a good place.

I said: “I want to try Greenfield Engineering. I moved over to the team that was working on tracing, building out the new product. And I said: “Oh my God, Greenfield is not for me. I am not the person you hire as your first engineer at a startup. I do not like the trade offs you have to make in those situations. I like to keep things reliable. I want to work on stability. I want to have a longer term vision that I’m working towards. And so, while I learned a lot about tracing, a lot about backend engineering, after a few months I said: “Okay, I’m going to try this SRE thing, because they’re the ones that are focused on reliability and stability.” And, I think that is where my interests lie. So, I popped over to a design company, did SRE, that’s where I was introduced to Kubernetes. I found out from there, and a couple other places I practiced SRE, no one does it the same way. And even people on the same team can have wildly different philosophies and reasons that motivate them and approaches that they think the team should take, and even how to prioritize initiatives. And when you are not senior, and you haven’t seen a lot of projects and haven’t seen the cycles of business and the trends and seasonality, I found it to be a little bit taxing. Always being like: “But isn’t this SRE? Isn’t this how we should do it? I would like to try this.” I think the role a little bit when you’re at, I was at B2B SaaS startups. When you’re an SRE at mid-size B2B SaaS startups, you really are working on infrastructure … You wear many hats. And I wanted to wear one hat. I should have probably gone to a big company with an established discipline. But, at the end of the day, all of that experience got me understanding what a SOC 2 audit looks like – reacting to spend-panics not only for the infrastructure cloud bill, but for the observability bill. It got me understanding when and how to address incident coordination problems and how to really be thoughtful when interviewing engineers about their on-call experience so that they open up, instead of seeing you as someone who’s just going to listen to them and never actually affect change. I’m very grateful for the companies I worked with, the teams that I worked on, and the mentors that still talk to me today and guide me.

But, unfortunately I was a victim of burnout, after a couple companies. No one had a great time during the pandemic, and especially not me. And, I took a sabbatical, and thought about: “What do I want to do? What are my skills? Where do my passions lie? And what is sustainable for me going forward?” which is what brought me to observability developer advocacy, where I get all the best parts of SRE teaching – going to conferences and sharing my advice and perspective and conference talks, getting to do webinars, coming on lovely podcasts. I get to share with the world why observability is important and I don’t just have to parrot Chronosphere. While I do believe Chronosphere is the best solution, I get to talk about OpenTelemetry. Most of the work that you’ll see come out of me is free, open source, applicable, no matter what backend or vendor you’re using.

For me, that is a lot more valuable to the industry than if I were just pushing out proprietary vendor stuff. And, I’m learning a lot about marketing. I promise you, we don’t sit and dream up new buzzwords. I think people think we’re just evil and think about how we get your phone number and spam call you like maybe other fields. Not us. We really just want to get the right message in front of the right person at the right time. I’ve seen a lot done and all of that led me to a very informed decision to come to work and put my face on Chronosphere’s product.

Elliot: There’s so many takeaways there. I think there’s bravery, there’s ingenuity, there’s intuitiveness to go and build that solution in your very first job doing a recruitment coordinator role: To then take on that step, continue going and be in the face of the unknown of not really knowing how to go and build something or write something – and look at where you are.

I think so many people will be able to look at this and hopefully take a lot of confidence and positives from this that anyone can do it. We do have a people shortage in tech, where obviously everyone’s thinking about very different ways to try and address, but it’s examples like this that are incredible that you should be extremely proud of. And hopefully people can draw from that and think: “Screw it. I’m going to do that, as well.

Paige: I really think that especially in the observability space, we use these really off-putting academic and scary sounding terms like telemetry, observability, cardinality. What I try to tell people is like, I know this stuff. I’ve worked at multiple monitoring and observability companies. I was surrounded by the statisticians who did the math behind how to group alerts together in statistically meaningful ways. I don’t know that math. I know they know it, I know what I don’t know. But at the end of the day, at this point with as much content and tutorials out there, when you run into scary, unfamiliar academic sounding terms, you don’t have to go to dry textbooks. There are a ton of DevRels who try to break this down and make it approachable. And that’s, I I think there’s never been a better time to start learning this space if it’s something that you really want to do.

Elliot: Lastly, before we leave: Chronosphere, OTel, where are they going to be in the next 6 to 12 months?

Paige: The hardest part of my job is that I get to see things that are in development that I’m not allowed to talk about publicly until we announce them. What I will say is, uh, around KubeCon, November, we’ve got some good things brewing. Pay attention to Chronosphere on all of our social channels. I think we’re really continuing to invest in how we help you adopt open source. How do we help you enable your engineers to learn the skills they need to learn for instrumentation, analyzing time, series charts, whatever it may be? We want to help you along every step of the way. Because we’re a young startup, we have a pretty rapid pace of development. We are not beholden to supporting a giant platform that does 20 different things. We’ve got a really good focus and we’re continuing to make that experience better. So, if you’re evaluating Chronosphere today, know that the goodness is only going to continue to come. That’s the advantage I think, of going with a younger startup that’s had time to learn from cloud failures and observability, mishaps and interesting pricing schemes. We were able to learn from all that and really provide a best in class experience.

Elliot: And this has been a class experience as well. This has been great learning more about Chronosphere. Hopefully we can accelerate some of that adoption as well, and people listen to this and think: “I’m going to use that. But also listening and learning more about your journey as well, and how you’ve risen from role to role. But it’s been great to have you onboard.

I always say this, but I always want to check in with people, see what they’re up to. We’ll be checking in at some point, seeing how the business is growing. Please have a great time at KubeCon. It’s mid-September now, but for everyone listening, likes, shares, comments, they’re massively appreciated. You’re going to have all the info that you need in the description to keep on following. Paige, thank you so much.

Recent News

Featured Resources

Podcast: How observability provides insights for businesses & why data control is vital

Featuring

Transcript

Related Posts