Addressing the buy vs. build dilemma of internal developer tools

Green Technology Image preview card (42)

Blog

Creating developer tools is essential to businesses. But how do you know which ones to create?

On: May 2, 2024

13 MINS READ

Reimagining observability with Chronosphere

Building out internal developer platforms is a necessity for engineering departments. Though it can be hard to determine when to build something internally or buy something off the shelf. This decision, however, is often determined by a variety of factors including engineering culture, budgets, and developer needs.

Chronosphere’s CTO and Co-founder Rob Skillington shared his thoughts on this question – as well as his experiences – with The New Stack at KubeCon North America 2023 in Chicago. During his appearance on The New Stack Makers podcast, Rob chatted about Chronosphere, the “buy vs. build” debate, and how M3 came to be at Uber.

If you don’t have time to listen to the whole podcast, be sure to check out the transcript below.

Is “buy vs. build” the right question?

Heather: Okay, great. Let’s get started. When people think of internal tooling, whether it’s monitoring or observability, continuous integration/continuous delivery (CI/CD), they often think of it as a build versus buy question. Is that the right question to ask?

Rob: I really think it’s not the right question. In terms of the way that it’s framed, it’s the right question philosophically, but the way in which it’s asked is typically: Do you build or you buy? It turns out in practice, what you’re actually doing is much more complicated and nuanced because you’re never purely just building or buying [internal tooling], unless you’re building everything from the ground up again, going back 50 years in time, writing your own operating system and creating your own chips.

You really are building on top of things that already exist, and it’s just really a matter of how much do you layer in of the build and the buy. I think that the thing that’s more important actually is first, before even asking that question, understanding what the different abstractions are.

And once you understand the abstractions of what you’re trying to do, then you can start to understand where the boundaries are in which you can build and buy [or] combine [both] build and buy into a solution that solves the whole problem.

Then once you’ve developed [where] the abstractions [are], which is really important, because then you’re not just solving for today, right? The short term you’re solving for the longer term because you understand the entire landscape better.

Once you’ve understood that landscape, you kind of can then start to carve out individual components that you can tell “could I build or buy here and have I drawn the boundary at the right level?” [Then] if I do want to ever revisit this decision, can I do so without having to change the interface between the abstractions?

If the protocol still keeps working, will in the observability world, can I bring my dashboards from one system to another? Then once you’ve kind of been able to do those first two tasks of abstraction layers, analyze each one, the third task is like understanding when should I revisit for each one of those the decision I’m making today around building or buying in one of those individual areas. Because that’s what I’ve found is the most important kind of relationship to understand going into this process of building versus buying in an area to solve a problem that you have.

Heather: So thinking about the long term, not just solving the problem right now, but down the road when the technology changes, are you, is this gonna be flexible enough? Is this going to be adaptable enough?

Rob: Yeah. And I would say that while people, they don’t 100% ignore it necessarily, right? Like they do future-proof their plans, but I think perhaps the analysis of the different boundaries and the contracts between those boundaries is not often as deeply analyzed. People just simply think, “as long as I choose an open standard vendor or tool or platform, then it doesn’t really matter. I’m not locked in.” But it turns out that, even though that’s true; there are incompatibilities between visualization elements.

If you build it, they might not come

Heather: When it comes to shipping good tooling, is it true or false that “if you build it, they will come”?

Rob: I think the answer to that question is “not always.” I’ve experienced that firsthand a few times. I can tell you about some of those experiences. The major takeaway for me is kind of understanding when to reanalyze and reassess that first [version] of the layers and the abstractions that you chose. It’s really important to understand when you should revisit those criteria of why you chose something.

Because what we’ve found when they haven’t come for [a tool], we were building a visualization platform internally at Uber at one point, and it was not as popular as the de facto tool that we had. We found out only 50 or 60 users were even using the system daily. Whereas the de facto tool [had] 1,000 engineers per day that use the system. So we should have really understood that as a KPI going into the project, and I think it’s really important to do a bit of that thinking upfront.

Heather: So [you should] know your audience and what they want first.

Rob: Yeah. Know your audience and how you grade the criteria on the tool you’re buying, building – and use that knowledge of knowing your audience. What is the most important thing for this problem you’re solving? It’s to be tracking and understanding if it’s meeting its criteria or not.

Addressing “not invented here” syndrome

Heather: Tell me about the “not invented here” syndrome. How can you tell if that’s present in an engineering organization?

Rob: That’s a really tough question because the answer is very vague.

Heather: Unfortunately, we love vague answers on this podcast.

Rob: I’m just kidding. I feel like someone could do a master class on [this question]. I definitely believe, and I think the thing about the “not invented here” syndrome and, and from my time [at Microsoft] that was very heavy on the spectrum towards a lot of “not invented here” kind of culture. I would say at a certain point, I couldn’t really fault the company too much because I understood that Apple had their own ecosystem, right? Microsoft has theirs and there’s really a design that kind of bled into the system where the more things that they use in the Microsoft universe, the experience gets better and better. That is part of the reason why at Microsoft, perhaps they were heavy on the spectrum because it actually improved the product, I suppose.

But I do think it was really difficult to work inside Microsoft because sometimes they were building things that I don’t think necessarily needed to be rebuilt for the sake of rebuilding. I would say that, you know, to more fully answer your question, I think it shows up in conversations really easily. Like when you’re kind of around a table, a set of engineers or on Zoom, obviously these days or some other conferencing software really just seeing where the center of gravity is. When people start to propose ways to solve problems, you can kind of see if they gravitate towards the “well, should we check out how other people do it? Or should we just jump to a solution that’s likely an internally built thing immediately?”

So I think that’s a good signal. Then I would say the other thing is just kind of looking around at the history of the company, as those decisions are well documented. Uber was much more rational because when I had come, it was a much younger company trying to achieve a goal in terms of becoming a profitable business. And you needed to get to where you were trying to go quite quickly. The faster the business moved, the more likely we could fulfill our goal of infrastructure as reliable as wanting running water. I do think even though it was more rational though, there definitely were pockets of not invented here things, because once you get to an organization where that solved one or two things that definitely did need to come in house. It starts to pervade a bit more of the culture, to those teams and those sister teams nearby. That’s a tough thing to do. But there are multiple indicators.

The M3 and Uber story

Heather: What are some examples of build versus buy decisions that you were part of? I think one I’ve heard about is the M3 story at Uber.

Rob: M3 was really born out of the fact that we had a really complex distributed system that was containerized and in a cloud native architecture where people hadn’t really brought 4,000 different microservices along to a cloud native workload before that quickly. And so the system was very complex at high scale and had transitioned from a virtual machine (VM) based world to containerization as well. So you had a lot of these compounding factors in terms of how difficult it was to really understand that stack, and it required a lot of raw telemetry data. So M3 was born to really cater to the fact that we needed levels of scale that weren’t previously necessary.

To revisit the journey a little bit. M3 and observability tools at Uber actually started out as being platforms and solutions that we bought and vendors operated for us. So M3 started to replace a sliver of that and then expanded to a much larger footprint as more parts of the Uber stack continued to get more complicated and required in-house solutions. The other thing though is that — and why Chronosphere exists — is it wasn’t just the raw telemetry data. It was the way in which you interacted with the data. Because you need a view into this system [so that it’s] understandable if you’re gonna be pivoting around 4,000 microservices.

In fact, I remember the tracing diagram. It would completely blow your — and everyone’s mind — off its head and shoulders. So it was really about getting in and orienting you around the part of the system that was experiencing problems and, and having a lot of jumping off points to get you to that place quickly and effectively and reliably regardless of your experience as an engineer or your tenure in the company. M3 at first was on Cassandra and Elasticsearch as storage solutions. It was really a thin layer and we had an abstraction of Graphite in the early days.

And that’s why I think the abstraction layers are really important here, because it actually did work out well for us, not reinventing the telemetry type. We could focus on what was important to the business: which was the scale and how we interacted with the data. We did then, of course, move to offer a Prometheus-style metrics interface to people. Then M3 was able to implement both those abstractions; there was one cohesive unit to run both of them, but a lot of that was bought.

The Cassandra and Elasticsearch were not vanilla off-the-shelf-solutions.We had help from DataStax as well with Cassandra. It was only later when we needed to completely leapfrog yet again the next level of scale and complexity that we then went to the layer of taking that storage level out and completely writing the time series database from scratch and the aggregator from scratch.

Heather: Did you have success in getting people to use it?

Rob: Yes; that’s a success story. The one we talked about before was not a success story. But M3 was massively utilized, more than a 1000 daily users were interacting with it through the dashboarding and we had more than a few hundred thousand alerts on there.It was really interesting seeing how that drop off, such as the hours people would log in and spend interacting with the data, it mimicked where we we worked around the world. Because more than half of our engineering force was using it daily. We only had 2000 engineers at the company at that time. It goes to show that observability is fundamentally such a necessity to every software engineer’s life [if they] work on any backend infrastructure.

Because without it, all the digital infrastructure would just collapse. We wouldn’t have enough capabilities to operate these things at any levels of complexity. I actually think that observability is helping us build bigger and better digital skyscrapers and it shows in the user data how important it is to everyone’s job to do their job effectively at scale and quickly at rapid speeds of development.

Empowering developers with Chronosphere Lens

Heather: You had an announcement at KubeCon. Uh, do you wanna tell me a little bit about that?

Rob: Of course. So Chronosphere Lens, which I’m really excited about personally, brings really the first major footprint in, in terms of our expansion into solving the complexity with how you interact with this cloud native data. As we chatted about, it’s not just the raw scale and aggregation; a lot of people obviously know about Chronosphere for our control plane and being able to fit much more observability use cases in much less dollars for you. It’s also about giving you that view into the system that gets your job done much quicker. So that helps you remediate issues much faster. It helps you orient yourself. It helps you onboard engineers. It helps you be more competitive in your own core business because you can develop and respond to market shifts and dynamics quicker because your observability is much more useful. It helps you get there quicker and more reliably without having hours of downtime.

Chronosphere Lens is the developer experience behind that. And the spiritual successor to what we built internally at Uber to solve that problem. It’s taken us a few years to get there, but I’m really excited about how we all wrapped it up in a bow. Hopefully people unwrapping it agree with us that it solves a lot of challenges that haven’t typically been solved before.

Heather: Terrific. Yeah. It does seem like developer experiences are at the center of what a lot of people are doing right now. They’ve just kind of, they’re the key to everything, right?

Rob: Yes. Especially in terms of how effective a single engineer can be at your company, I think it is really important because it gets to that aspect of how competitive you can be in your own core business if you are not able to be the most productive version of yourself as an engineer.

Heather: That’s probably a good place for us to wrap up. I just want to thank you for joining us again, Rob. It’s been a pleasure.

Additional resources

Curious about Chronosphere and developer efficiency? Check out these resources:

Recent News

Featured Resources