As cloud native environments evolve and increase in scale, developers need the right tools to help them oversee all the necessary components and sift through telemetry data. But it can take time to figure out what tools both benefit developers and help drive forward business needs – as well as some trial and error.
Last month, Chronosphere CTO Rob Skillington sat down with the Stack Overflow podcast to talk about his path into the observability sector, learnings from building internal projects at Uber, thoughts on the “buy vs. build” debate, and projections for observability tooling.
If you don’t have time to listen to the full podcast, read on below.
A journey into observability
Ben Popper: We have a couple of great questions prepared for you, but first let’s set the stage a little bit. Tell folks just quickly, how did you get into the world of software and technology? What was the path from working sort of in the field you’re in to founding and being tech lead at your own startup?
Rob Skillington: It was an interesting journey. I came from Australia the first year out of college. Before that, I’d interned at Microsoft in Seattle for three months, following a colleague from Melbourne University. We were both mentoring and teaching a class together and he told me he just got back from doing a full winter in Seattle to work for Microsoft. As an Australian, you kind of ask yourself the question: “Oh, that’s a possibility?” So that’s how I got in touch and connected with Microsoft and went there to work as an engineer on Office 365, which then was called Microsoft Online as they were, you know, building and quickly launching to compete against Google Docs.
I’ve kind of always been fascinated with infrastructure and software, even from high school, but that was my introduction to the real world of it. I did some time there and a few startups after that.
Went to a smaller place, but coming back to [a level of] engineering rigor and professionalism is what I was very hungry to follow. It turns out that companies like Uber have an incredibly high need for reliability because hundreds of people can’t work when the software is not running.
That’s kind of how I ended up at Uber, solving some of the challenging infrastructure problems there. But I quickly moved to be one of the first few members of the New York team, tackling the metrics infrastructure projects. And that’s how I got where we did today.
The evolution of the M3 database
Ben: We did a previous podcast, Five nines uptime without developer burnout, with Martin Mao, your CEO and co-founder, just talking about how you diagnose errors in production, especially during a crisis and as systems get more complex. I know you two were at Uber together where you developed a lot of these ideas, as you said, dealing with so much data and wanting to be, you know, able to jump on a problem immediately and diagnose it and make sure there was as little downtime as possible.
I want to get into some of the issues we were going to talk about today around when do you build and buy this [type of tool]? How do you convince people to adopt it?
But before we do, just one last thing; Chronosphere is a company that kind of came out of that experience the two of you were at Uber and then you left it. Can you tell folks what M3 is just so we’re sort of setting the stage here?
Rob: Yes, that, that’s it. It started as open source. Well, actually, funnily enough, it started out as an internal code name for the project that was not intended necessarily to be open source, but the metrics infrastructure project, which was taking over from Graphite at Uber. It kind of started off as a Cassandra and Elasticsearch system.
M3, as most of the world knows it, is an open source project that we started in 2016, under the same name, to rebuild it from a fundamentally leapfrogging nature style, where we would go and optimize to the absolute last possible step that we could to get best unit cost effectiveness to give developers back the freedom to instrument liberally their applications and services to help them scale the business – because there really were so many unknown unknowns they faced in production. So having all of that extra observability data really was a core key to how products got built at Uber without risking the entire thing sliding sideways when it went to production.
Managing microservices at scale
Ben: So I’ve heard Uber has, or had, may still have thousands of microservices. How did you go about creating this project and planning for it?
Rob: So Uber has 4, 000 microservices and did at its height. I actually believe that they’ve over time been slowly trimming it back a little bit. At one point, there were two microservices for every developer, which was rather insane, so there has been a little bit more of a reconsolidation.
What’s kind of fascinating is that at the very beginning, there were two monoliths. There was a Python backend that had all the trip stores. Then there was a Node.js real time trip service, essentially. It really went from the opposite to very extreme ends of the curve. And now they’re kind of like balancing out at a more healthy medium between the two.
But in terms of rolling out infrastructure to such a diverse set of applications and systems, we had to really be thoughtful in terms of the abstraction level. You want to interact with the systems, because it’s also a variety of very different client languages. We were one of the few companies, I think, at the time that had just said, “you know, let developers do what they want, like, that’s how we will move and execute faster.” However, that came at the huge cost of supporting Python, Node.js, Go, Java. The level of thought in terms of what kind of abstraction do you want to offer these systems and developers was very important.
So that’s why we were very keen on day one from developing using open source standards where there would be already existing client libraries, for instance, for observability, for networking, which is kind of an interesting story because eventually that, you know, Uber has moved towards gRPC, but did famously also invent its own PC framework, like Twitter did with Finagle.
So that kind of got standardized and consolidated. And honestly, the same kind of thing happened with M3 in terms of like the interface that we wanted to offer people became Prometheus over time, right? It actually started out as Graphite. And we housed both Graphite and Prometheus data within the same backend, which was kind of different to how the systems of the day were doing things. They were most of the time essentially just offering one type of language and backend. Whereas we built the abstraction layer so that we could kind of layer on the different languages and ecosystems on top of that, both on the ingestion side, but also the query and alerting phase. That I think was critical to making sure people could onboard easily and get going.
I will say we fundamentally, from day one, were very opinionated and needed to be a multi-tenant system where you could just add machines horizontally. Whereas, like, back in the day, you know, Borgman at Google developed this kind of pattern. I know Borgman’s kind of the previous generation these days, but a lot of people have come from that kind of age where most teams ran their own infrastructure for monitoring rather than using a centralized service. I think that that was another key decision to what made it quite popular is that people didn’t run into weird, strange edge cases in their own instance of the observability stack, because the observability stack was centralized and ran for them in one very accessible layer.
Build, buy, or something else? The debate with tooling
Ben: Your time at Uber spun out into Chronosphere, which is a company on its own that’s thriving, plus M3. [If] I’m starting my own startup and I say, “you know, we’re going to build something internally. It’s going to become the next Golang and, or React or M3. This is the way to do it. All the geniuses here will get together. We’ll build this for our own problems and then it’ll create an ecosystem around us. We’ll attract the best talent.” As companies are having that discussion internally (build versus buy) is that even the right question to ask? How should companies frame these questions about observability and internal platforms?
Rob: I wanna start off by saying I don’t believe it is the right question, and it absolutely needs to be framed better. I think when the question does get raised internally, because at the end of the day, you’re always buying a little and building a little, there’s really almost no world in which you’re doing solely one or the other. I would suppose that the big thing for me is that I think what’s interesting about (even a project like M3) is that in the early days, since there was no need for there to be special things that the company needed from the monitoring and observability system, off the shelf was what made total sense.
So it’s really a journey that you go on as the requirements kind of shift and change in nature. And even as we just chatted about the versions of M3 that were offered internally changed fundamentally, the more and more we realized, okay, off the shelf, like didn’t work in these certain areas; we’ll swap out certain parts of it to build internally on.
It really is a journey and a selection of what you’re building and what part of it you can buy. And honestly, you know, a lot of the time, unless it’s core to your business, it doesn’t really always make a lot of sense to invest a whole bunch of time there.
I think that at Uber, even if there were plenty of things built in house, there was at least a default of: If it didn’t matter to the core business, you should always assess what parts of your problem you could buy instead of build to move the business faster.
I kind of believe that you should frame these conversations in terms of what does your organization actually need for a longer term in a specific area of concern, making sure to think instead of just looking like, “what do you need right now?” You don’t make short term decisions that impact longer term strategy accidentally. And then what lines of abstraction can you draw to make layers for the major pieces of what you do need in such a way that between each of these boundaries, something can be built or bought instead of you just having to wholesale build or wholesale buy.
Then, for each of those kinds of layers that you define, you know, which one does it make sense to build or buy? And on what triggering points should you revisit that decision? For instance, in terms of what are the major benefits today (like buying or building)? Is there an associated cost that you’re trying to calculate there? And when would that cost trade off in terms of how many people do you apply to the problem, right? Because sometimes it can seem a lot cheaper to do it internally. But then over time, if your team grows to 15 people, well, that’s a massive cost to the business, right? So I think those are the three key ways in which I would approach some of these decisions.
The culture around observability tool adoption
Ben: Do you think it’s easier to get internal adoption on internally built products? Is there sort of an internal pride associated with the “not invented here” attitude?
Rob: That question is really interesting. I think that there’s naturally a tendency for folks, especially earlier on in their career [think] that it can be done better. A lot of the old tools don’t quite understand the current ecosystem and therefore it could be done better. I think that there is definitely a level of discovery that is really important to go on during the journey of, of a kind of assessing a tool: Whether it should be built internally or built on top of something purchased, right? That journey is something that you learn over time.
When we started M3, there was a 40 page document put together, in terms of what is the current state of the world. And it was framed like, “What are the problems we have today? Why did we arrive at the problems that we have today? Is it simply like the way that we’re doing things that could be changed to fix some of these problems?”
Then [we had] a thorough assessment of what the options are on the table, and the kind of prior history and art right out there in the ecosystem. We actually had a bunch of talks with other companies of similar sizes at the time, to understand what they were doing, and the journey that they’d gone on in this space as well. You know, M3 is quite a large project, so obviously that that level perhaps is overkill for a bunch of projects. But then I do think that that’s some degree of doing that, just the process of undergoing that and holding yourself accountable and being diligent on that journey is important.
You can build it, but they might not come
Ben: What are some examples of sort of “build vs. buy” decisions that you went into, you know, you thought, okay, we made the right decisions or the wrong ones, you know, what were the outcomes? And what do you think folks can learn from that?
I think probably we’ve discussed M3 and Uber was a big one, but maybe dig a little deeper into that one. If there’s others you want to discuss, we’d love to hear about it.
Rob: There are actually tons of examples of, “if you build it, they, they might not come.” I was responsible for one of those projects before M3, and it was basically a system that essentially optimized a lot of the mobile interactions with the dispatch trip systems in terms of doing real time synchronization of some of the data in there. That was kind of overkill for the time at which the product that we were offering on top of was being built.
It was set up to support Uber Eats and then it was naturally going to take over the dispatching systems. That synchronization technology, though, fascinatingly enough, is core to the success of another [project]. I mean, it’s been rewritten and completely reimagined. The engineer I was working with on that problem has started a project management company called Linear.
I think that’s a project to me [that] showed the fact that even though it’s probably the right long term choice for that problem, it was definitely not the right time or place to build that. People did not come because they had 99 other problems and integrating onto this new framework was, you know, not the top one of them, right? So I think that it’s really important to do that kind of consensus work, even if you’ve kind of already carved out a whole area that you know you need to improve anyway.
The other one that’s kind of interesting to chat about is we had a visualization tool that we put into place that sat alongside Grafana to kind of like show a more consistent observability set of visualizations and insights into the system so that an engineer that was just hired into the company could actually orient themselves rather than have to wade through 10,000 Grafana dashboards — which is a real number of dashboards that Uber had.
I think 8,000 of them were probably not used in the [previous] year. But it was really a difficult thing to understand the complexity of the system, especially with 4,000 microservices, right? So there was this idea that we could present a more consistent view and let people navigate between the systems and services themselves in a fashion that kind of felt familiar to people. Then also it showed the dependencies upstream, downstream, [and] whether those systems were experiencing problems. Because when you get paged on call, the last thing you want to do is essentially work out whether the errors are in a downstream system and you’re just being woken up for someone else’s problem.
Anyway, this project was built, and unfortunately, you know, we started tracking the internal usage of it, and it did not add up to anywhere near the level of Grafana. Grafana had 1,000+ daily unique users of the 2000 strong engineering force, which I think kind of speaks to how important observability is to people’s day jobs in software engineering.
But, this system that was developed to give a more consistent view into things while there were some passionate users is still only Had 50 to 60 users logging in a day and it would have been better to probably build one of those experiences inside of Grafana itself or approach the problem in a different manner, right? Because we built it, then they did not come.
Bringing in and developing platform engineering
Ben: We’ve talked about platform engineering with folks before. Obviously, it takes a lot of back end engineering, distributed systems, infrastructure stuff, and consensus building. What are the sort of other hats and skills that you use when you’re building a platform?
Rob: I think that the most interesting thing to keep in mind is really to understand what are the challenges of the business. I found that through my own journey in working on projects that were put in front of me was just fine. But if you didn’t fully understand what was the most important thing to the business, then you really took a tangent a lot of the time, right, to what was actually delivering value to folks.
So that was a learning just in terms of career, but then also in terms of the projects and how to approach the projects, right? I think that at Uber with M3, that really helped guide what parts of the observability system to focus on at any point in time, right?
The storage side of it, while the building of M3, the time series database was both like an academic kind of enjoyable aspect, but also a nightmare in terms of like, building a storage system that your whole company has to depend on being up at every minute of the day. It was definitely one of those things that happened much later in the game, right?
There were tons of other things that mattered more to the business than the unit cost of the observability system until the unit cost of the observability system was, you know, one tenth of the hundred thousand servers that Uber had, and then you had to be like, “this is actually the biggest problem in observability for the company at that point.”
I think navigating that really critical path of “what is the special sauce?” and “where is the thing that’s going to be the most impactful from the thing you’re working on?” and how to show people that as well and guide them towards that. Because a lot of the time, right, folks come from different companies, they have different opinions about their own tools and stacks and infrastructure.
I think if you can help them see what is the most important aspect for your company, your project, your team and organization, and why that might need to be done slightly differently to the universe they’re used to. I think that that’s really important to focus on.
Observability tooling and adoption trends
Ben: What was really interesting about this [conversation] was you identified some of the pain points that you had at Uber that led to this being a successful build project. What are the things that you think are going to be interesting in terms of observability challenges or opportunities in the next year or two?
Rob: There’s a lot happening out there right now. I feel the common theme that’s kind of occurring is: You know you want the infrastructure and developer tools and entire ecosystem to kind of just, get out of your way and let you do magical things.
GitHub Copilot is obviously top of mind for folks helping you write your own code with artificial intelligence (AI). Of course, there’s plenty of other tools out there that just kind of streamline the whole process, right? That’s why continuous integration/continuous delivery (CI/CD) has been such a large focus for over the last 10 years.
I would say that the big thing that I’m seeing in observability especially is you kind of can’t really anymore spend so much time self-managing these systems. For instance, when we had more static servers at Uber, we [would] literally use [an] intelligent platform management interface (IPMI) to go into servers to reboot them and stuff like that on bare metal. You could kind of really treat them more like pets, right? You could put in a little bit of extra work and get some nice things back in terms of really customizing your setup.
But nowadays, it’s kind of turned to more of a place where the less customization you do by hand, the better, because that streamlines the whole process and you can move between clouds and different frameworks and types of things quicker, because there’s more standardization in place.
In the observability world, you’re also managing all this data, right? The data has never been more significant in volume and complex in how you query it. That kind of management has exploded and that’s caused people to slow down a ton.
What you want to get to is this nirvana where people aren’t on-call being paged all the time. It’s a really big problem. Like you have a single developer spending hours each week answering 20 high-urgency pages. That’s one of maybe four on the team that doesn’t get work done that week, right?
And I think that the ability for these observability systems to remove certain levels of complexity by themselves, just by managing themselves better, which you know, Chronosphere is built on that aspect of it has to be more automated, save you costs, so that you’re not managing the explosion of data. But [you also want to] make room for the observability teams of these companies or the SREs to essentially make your on-call health healthy, and that is really what we’re trying to do here. We’re trying to make people productive, not distract them with random things that completely drag on the entire team’s productivity.
I think that you’ve seen Cribl is obviously a big new entrant to help people like manage their Splunk and similar workloads. There’s a reason why all these companies and Chronosphere, for instance, have grown so quickly in the past few years. It’s because this is actually a game changer in terms of self managing the data and making the ability to get to a healthy on-call state much more achievable.
It kind of feels like a cost optimization to some people, but it’s much more than that. It’s really about: This high telemetry data isn’t even useful in its current form let’s reshape and help you manage that more automatically to get out of your way [so] you can get back to your day job and get your job done quickly.
Learn more about getting started with cloud native architecture.