Why does it take higher levels of observability to manage financial operations (FinOps) in complex IT environments? What is behind the growing interest in FinOps?

In a September 2023 episode of Techstrong.tv, Martin Mao, CEO of Chronosphere sits down with Techstrong Group’s Michael Vizard to chat about the relationship between FinOps and observability, the problem of cost transparency in a cloud native world, and advice for those looking to implement observability. 

If you don’t have time to sit down for the video, you can read a transcript of the chat below. 

FinOps: The talk of the town

Michael: We’re here with Martin Mao, who is CEO of Chronosphere. We’re talking about FinOps and observability, and how there’s not enough of these two things going hand in hand. Martin, welcome to the show. 

Martin: Thank you, Mike, for having me here. 

Michael: We’re hearing a lot of folks talk about FinOps these days, and there’s clearly a lot of interest, especially as the economic headwinds get a little tougher for certain vertical industries. It seems to me there’s a lot of folks who just simply forgot how to do FinOps. We used to do capacity planning all the time, and now we’re kind of struggling. Every survey that I see suggests that there’s a lot of interest, but not a lot of know-how about what’s going on. 

Martin: My personal view on things is, I think over the last couple of years in a zero-interest environment, perhaps the focus on cost efficiency took a backseat to improving the top line and building new features. I don’t know if people necessarily forgot how to do it — but perhaps it was just lower priority on the list of things that every company and engineering team had to focus on. I think that changed in the current macroeconomic headwinds. I think that has put the FinOps function and focus on cost back to the front and center. 

Higher costs in a microservices environment

Michael: It also seems to me that once you go down this path, you quickly discover that these environments got pretty complex in the last few years. It’s getting pretty hard to figure out what’s going on; there’s not a lot of transparency, there’s a lot of dependencies. I’m not sure you even know where your costs are. We talked about observability. How do I get in there and correlate what I’m seeing to something that relates to what it’s costing me? 

Martin: The other trend we’ve been seeing the last three to four years is this move to a more cloud native architecture — a lot of companies are containerizing their infrastructure, moving more towards a microservices-oriented environment. You think in that type of architecture, it’s much harder to attribute costs. You are running and using a tiny piece of compute [power] for a fraction of a second, or a couple of hours, as opposed to in the older world, where teams just had a bunch of VMs and that was their cost. I think the problem of even identifying and attributing cost has become a lot tougher over the last few years, just as our architectures have evolved. 

Observability does play a pretty key role in the sense that, imagine these observability systems, like the ones that we build predominantly, are used to tell you when something is wrong in your infrastructure or application. However, it also has all of the utilization data. It actually tells you how much compute is being used for a particular workload. That utilization data is key to figuring out how much resources are actually consumed for the workload. That’s a key ingredient into figuring out your efficiency, and looking for areas of improvement. 

DevOps metrics for cost optimization

Michael: We collect all kinds of fascinating metrics in the world of DevOps, but I don’t think we have one that says: “This is how much it costs alongside the performance ones.” Do we need another tab or window that says: “This performance came at the cost,” and correlate the two? 

Martin:  I think that is one of the main functions of a lot of these cost optimization platforms out there. Again, that may have taken a backseat in the last few years, but we are starting to see that become increasingly popular for sure. I think to your point, having that view side-by-side for your infrastructure is pretty critical. You can imagine what is my utilization of my compute? And versus, what is the capacity I provision? How much am I paying for it, versus the value of it — it is definitely an interesting concept in our cloud providers in general, and in the compute infrastructure. However, even when you shift over to observability, and when you look at the observability data, the same thing applies. 

Increasingly, over the last few years, a lot of observability data is now produced. It’s very expensive. You need a similar concept of: “Well, how much is this observability data costing me? What is the value and utilization I get out of it?” And trying to look for optimizations there, as well. Similar types of concepts now need to apply well beyond just the infrastructure that a company runs today. 

Michael: Can I play what-if scenarios? One of the things you want to do is get in front of this before the workload is deployed. So, can I slide things on a scale and come up with a cost factor, and can we just be generally smarter? 

Martin: For us in the observability space, we’re definitely thinking about things like that — because you can imagine that as soon as you’ve done the changes and paid the cost, it’s almost too late. You can only fix things moving forward to prevent these things from becoming out of control. For our platform, one of the things that we are looking at is, as new observability data gets produced, we do want to give companies an ability to assess how much that is going to cost ahead of time. 

Before you actually pay for the data, because you may not be using it, you want to give them an opportunity to say: “This is what the cost is going to be. Is this really worth the value you’re getting out of it?” And have a company make that decision of yes or no call there, before they end up having the cost. The pattern becomes more of a preventative approach as opposed to after the fact, here’s your overage bill, and: “Maybe you can do better for the next month,” because that pattern generally doesn’t work out well.

The world of AI and cloud costs

Michael: All right, here’s the little thing happening called AI out there. Can we apply that here and maybe [artificial intelligence (AI)] will save us from ourselves? 

Martin: I think if you look at the latest innovation in the AI space, there’s a lot around large language models, a lot around generative AI. I would say that innovation in those spaces are probably less directly applicable to the problems that we have been talking about a lot more recently. Perhaps, it could be a better interface into how you interact with some of these insights. 

I don’t know if the innovations there in the past year can be directly applied to the type of problems that we’re talking about. However, if you look at the cloud cost optimization industry, or even the observability industry, there is a lot of automation that can be done these days. 

One example from the Chronosphere side is that we are automatically detecting inefficient usage of particular observability data, and suggesting and automating the optimization of that, just like cloud optimization platforms have done for a long time; they can now automate a lot of the optimizations there. Now, it’s automation. I don’t know if a company wants to market that as AI per se, but there is definitely an automation component to things. I don’t know if it’s quite in the same ballpark as a lot of the large language model innovation in the generated AI innovation we’ve had in the industry, in the past year.  

The trend of “do more with less”

Michael: Who’s driving these FinOps conversations? Is it the finance people showing up, and are they really trying to drive down costs, or are they just trying to get a more consistent cost level going, because they’re tired of being surprised at the end of the month?

Martin: I would say that it’s a little bit of both. If you look at the FinOps Foundation, it was actually created in 2019. So, it’s been around for quite a few years. If you look at the Fortune 50, 90% of them do have a FinOps function. So, it’s existed for a while. As we talked about at the beginning, I think the focus or priority on it dropped in the last couple of years. But this year, it seems to be front and center. I would say it’s heavily driven by the finance organizations who are trying to make every company more efficient this year, right? That seems to be the headline: “Efficiency: Do more with less.” It seems to be the headline for most companies in this particular year. I would say that’s why there’s a renewed interest. 

When we talk to companies, it’s a little bit of both of what you suggested — it’s both: “Can I reduce my current bill right now?” Because that would help me reduce the company’s burn, bottom line and get a lot of companies towards profitability.” But more importantly, companies really want control over this. They want more than just having a lower bill today, they want predictability and they want control over: “If my business does grow 20% next year, I want to predictably know how my cloud infrastructure is going to grow in correlation with that. How is my observability going to go in terms of that? 

We are talking to a lot of finance organizations, where they have a calculation: “For each increased top line dollar, I’m happy to pay X in infrastructure costs,” or “I’m happy to pay Y in observability costs.” But as long as I have visibility and control over that, over the long run that’s even more important than saving dollars today. Of course, everybody wants to save some dollars today, at the same time. 

Implementing cloud native observability

Michael: What’s your best advice to folks who are trying to implement observability? Because a lot of the folks that I talk to love the idea, but are a little overwhelmed by everything, and they’re not quite sure how to get it to actually function the way that they had hoped. Even once they get it, they’re not sure what questions to ask. So, how do I get this into some realm that it’s just more accessible to everyone?

Martin: That is a fantastic question, and I’ll even explain why that is happening. As we talked about earlier, the architectures are getting a lot more complex. So, the tool sets that worked for us years before, don’t quite work anymore. They weren’t really optimized for cloud native workloads – containers and microservices. I’ll say that the first thing is looking at a tool set and applying the right tool set for the right environment and architecture. If you are moving more towards a cloud native workload, for those workloads in particular, perhaps picking a tool that is optimized for that type of workload would be one thing that would make it more effective. 

One of the other blockers we’ve been seeing, and this is related to cost, is as the traditional tools as a company does this shift over to containers and microservices, the traditional tools get a lot more expensive, because a lot more data gets produced in these new environments. Cost is often a blocker. These tools may be working great, but cost-wise: “I just can’t afford to have coverage, or all of my workloads, or all of my hosts.” That is a really bad place for companies to be in. 

Frameworks for better visibility over data growth

Martin: I’ll say from that perspective, as Chronosphere joined the FinOps Foundation, we released a vendor neutral framework called the Observability Data Optimization Cycle. It’s a bit of a mouthful, but essentially it’s a framework to apply to get visibility over the cost of observability and particular techniques into how to control that growth and data. Irrespective of what tool you have, there are ways and techniques in which, through this framework, you can get control over that growth of observability data. So, that could be one thing that could be useful for companies out there to solve that cost problem. And then, perhaps picking the right tool for the right environment would be my other piece of advice. 

Michael: Are we still culturally too wrapped up in chasing predefined metrics and monitoring tools, and not quite going to the next level? I feel like a lot of people are still struggling with: “Well, I do monitoring, why do I need observability? Because doesn’t monitoring mean observability?”

Martin: I think there are a lot of thoughts in this particular area. My personal view on this is, to your point, the buzzword is now observability. The end result of what you’re trying to achieve is the same, right? We are all trying to reduce mean time to remediate (MTTR) like we have for the last 20 years, or mean time to detect (MTTD), right? We’re trying to reduce the time we can detect and resolve issues. That’s the name of the game, and that hasn’t changed. 

The thing that has changed is these architectures. The infrastructure you’re running on is so much more complex now, than it was before, and perhaps you need a new tool set to go and approach that. Now, that new tool set could be, to your point, not predefined. Maybe you don’t want anything predefined, and you want everybody to just go and access the raw data and go debug your problems that way. 

That could be one approach to the problem. What we have found working with a lot of companies out there is, that approach is only effective for certain individuals in an organization — the power users, the folks that know how infrastructure runs. However, what we find is that most of the operators of services in production these days are the average developer. 

The average developer doesn’t know how the rest of the infrastructure works, because it’s fairly complex. We do find a need to still have some easier concepts that anybody who is not an expert in observability can pick up and use to go and debug their systems and get the job done — not just cater to the power users. My belief is that both exist in the world,  and tooling probably needs to serve both audiences.