Balancing the scale of log data with value

Copy-of-Green-Technology-Image-preview-card-2

Blog

It’s easy to get frustrated with log data – especially when volumes feel time consuming and uncontrollable. Read the transcript to learn how teams can manage high amounts of log data while still gaining insights, or watch the full video at the end of the blog.

On: Jul 25, 2024

24 MINS READ

The log data growth problem

When log data continues to grow at a massive rate, modern teams can feel like they’re losing control over their complexity and scale. In a recent survey of 120 engineers, we found that some teams see an average of 2.5x log data growth over the past 12 months – and even up to 10x growth.

As frustrating as managing log data can be, ditching your logs isn’t the answer. In a recent interactive roundtable, observability experts came together in a panel discussion to uncover log data management trends, and preview our new Telemetry Pipeline solution.

Read through the full transcript below, or watch the on-demand video.

Let’s talk log management strategies…or lack of!

To kick off the discussion, Chronosphere Head of Product & Solutions Marketing, Rachel Dines tees up what the panelists will cover when it comes to current log management challenges, and how teams can create a successful strategy.

Rachel Dines: We’re going to start with just a quick backdrop on what we’re seeing in the market landscape. I think you’re all here because you know you have challenges with log management. And so, we’re going to talk a bit about what we’re seeing in the market and actually share some recent data that we collected that I think will be interesting for the discussion.

But then, we’re going to jump into the roundtable discussion. We’re going to have some polls, we’re going to have some debates. This is where things are going to definitely get interesting. So, be ready to vote with your mouse clickers. Then, we will give a quick overview of some of our product offerings and then into the lightning demos.

So, with that onto the landscape and the challenges. I don’t think I need to tell all of you that log data is growing at a massive rate. Why is log data growing so quickly? This is something I see across all of our customers and many, many people in this space. Well, one, a big part of it is being driven by the move to microservices and containers, right? So, much more data is being generated there, but there’s also a lot of different constituents and stakeholders, right? From the development team, to the security team, to operational teams, they’re all collecting the same log data. And then, it’s all collecting the same log data and then using it for different use cases when they could be doing maybe a more consolidated collection.

Not only that, because of the many different teams and use cases, there’s a lot more different tools. There’s a lot more different places to route to. And there are a lot of different collection formats and types. And so, this is getting incredibly complex. And of course, there’s not a lot of standardization on standards.

There’s not a lot of strategies. And most of this data is likely not even that useful to you. And what we’ve seen in the market; we recently did a survey of about 120 engineers and engineering leaders in the enterprise space, and asked them about log management and what was going on with them and their logs.

A few things that came to the surface that were really interesting:

There’s been 2.5x log data growth over the past 12 months on average. Some people were seeing like 10x growth. Some people were seeing less than that, but that means for most of you, if you used to emit a hundred gigs a day a year ago, today you’re emitting 250 gigs a day. That’s a significant amount of growth. A lot of that data, like I mentioned before, is not necessarily all that useful.
We found that 40% of the respondents were moving more than 100 gigs a day and 20% were generating more than a terabyte a day.I know I have personally seen people generating dozens or even hundreds of terabytes a day. I bet some of you out there might be even in the petabyte range.
And then, the most interesting thing too was the log collection format. So, we asked people: “What technologies are you using for collecting logs?” and they could pick as many as they wanted. 71% of people said they’re using more than one collection method. And this is just …a word cloud showing how frequently people were saying what the collection method was for them.

So, you can see [this] all over the place – from open source options like Fluent Bit, Logstash, OpenTelemetry, to closed source ones like CrowdStrike, even some Splunk options in there as well. So, basically not a lot of standardization and not a lot of consistency in addition to high growing complexity.

The importance of a log management strategy

So, where does this leave us? Well, this leaves us with more data equals more costs. You don’t need to be a genius to figure that out. On top of that, it’s more time consuming to manage and route all of these logs. So for example, I was working with a customer, one of our large investment bank customers that used to spend months, literally months, just to set up a single log integration from source to destination.

It’s very labor intensive. And on top of all of this, We see that few people have a strategy when it comes to their logs, so they don’t really think too much about things like: “What are we using for what?” And like: “What’s the purpose of this?” And I think I want to throw it over to Bill real quick to give us a real world example about this, and how the lack of log standards and strategy and like, how has this impacted you in your past lives?

Bill Hineline: I mean, it’s a great point to bring up. I think that you capture it right here, but when you think about the volume of logs that we all deal with and the varying sources of all those logs, the formats alone can keep an army of RegEx experts busy to get things moving in the right place, right? And all of that’s necessary to get the right value, to get them in the right place.

But you can’t think about that when you’re building the RegEx. You’ve got to have a strategy that says: “We’re going to follow these formats. This is going to help us drive better insights. And oh, by the way,we need to make sure we’re logging A, B and C, right?” So, it’s something to think about and not necessarily when you decide you need that army of RegEx experts.

Rachel Dines: Yeah, so true. I think it’s hard too to set a centralized and consistent strategy across a large distributed organization when every team and every developer and every group wants to do something a little bit different. Not super easy. So, even just coming in with some intentionality about this is really, I think one of the first key steps.

So, anyhow, long story short. We’re all here because we recognize that something needs to change. We’ve come to a breaking point when it comes to logs. We can’t just keep collecting more and more data with more and more different kinds of formats and expect outcomes to get better, because they’re not. So, let’s jump into the roundtable discussion where things are going to get more interesting. This is where I need all of you to get ready to vote. So we’re going to field a couple of polls and have some discussion around them. These are anonymous polls, and so I’d really love for everyone to just click in and participate so we can have a discussion around what the results are.

Matching volume of log data to value

Eddie Dinel: I think that the problem that we fundamentally see is that the amount of logs grows at a different rate than the value of logs grows arithmetically. It increases a little bit every year, but the volume of logs is growing geometrically because you have so many microservices, because you have so many different components, all generating logs. And that’s growing at a totally different rate.

One of the jokes that I like to have about this is that, even if you think that only 10% of logs are useful. Figuring out what those 10% percent are in advance when you’re not actually in troubleshooting that’s the impossible dream. 10% of them are useful in there, but figuring out what those 10% are before you’re actually digging through in the 3am triage meeting, that’s the impossible bit.

Rachel Dines: It’s so true. I mean, if I could only wave the magic wand. Even for the one person who thought 40% or more are useful, that’s amazing. But like, do you know which 40%? It’s a great question. And it leads me into one of the questions that we got from the audience. I thought this would be a good moment to pause and discuss – which is: “How have you seen challenges in log data management evolve over the past few years?” Eddie talked about microservices and containers and that kind of driving some of the challenges here. Does anyone want to chime in on that question?

Eddie Dinel: Well, I actually want to expand on that just for one more second. I think that’s part of why you’ve started to see the emergence of telemetry pipelines: The sheer volume that’s starting to be generated of logs is like: “Hey, we’ve got to be able to do something with this.” We have to be able to manage this more effectively. So that’s one of the big evolutions that I’ve seen in the industry around this space.

Bill Hineline: As the Chief Pessimist Officer, I’ll chime in here and say that when you think about logs, 10 years ago they were written by people who were going to look at them and not everybody looked at them or even knew where they were.

Now the content of logs is looked at by a lot of people and if you’re not the expert in the app or the infrastructure component, then those logs don’t provide a lot of value to you. And when we think about observability, I think companies will do best if they think about observability for all.

And providing the color and the context, or as my good friend John Potocny says, logs are like the glue of observability, right? And so, if that glue isn’t strong, then you’re not going to be able to paint that nice picture, and you’re going to spend a longer time on that 3am call. So, that’s the reason for my pessimism. I don’t want to be pessimistic and paint a picture that all logs are garbage because they’re not. It’s just the way we’ve looked at logs over the years is changing. We’ve gone from: “I’ve got this log over here in this directory that people don’t really know I’m writing to,” to: “It’s going into these telemetry pipelines where everybody can look at it and context is important.”

Alok Bhide: I think there’s a couple of things I’ve seen. I’ve actually been in this logging space for very, very long as a vendor most of the time. Eddie touched on one main point, which is that you don’t really know your log data. So telemetry pipelines have really started taking a stage … Logstash is a pipeline. Telemetry pipelines though are much easier to use, easier to play with, easier to deploy, etc. So they achieve your use case of controlling the data much better than using something off the shelf. So that’s one major thing. But also keep in mind that logging, logging products, and logging itself has become commoditized. Way back in the time there was Splunk. It was the best product out there. But today it’s commoditized. So, in some ways the challenge has been dealt with by more vendors in the industry creating better products, pushing down prices, offering more feature sets that allow you to control that data. Some do, some don’t. So, that’s one vendor-way that I see the industry has changed and that has helped in the cost part of log management over the few years.

Rachel Dines: Yeah, no, those are all great points. And there was another great point in the chat too that I wanted to call out is that the ability to correlate log data with metrics from other monitoring tools is key to gaining a holistic view of system performance. We’re looking at trees, and there’s a forest here and logs really don’t stand alone. They need to stand with metrics, events, and traces. It’s a fantastic point. So, thank you for chiming in on the conversation. I’m going to move along to the next poll question, which is about predicting the future. So, everyone get out your crystal ball, and once again, just use your best guess.

The future state of log data

Rachel Dines: What do you think the future state is of logs as it relates to observability and troubleshooting? So, as we mentioned earlier, we use logs for a lot of things. We use logs for security, for observability, for operations. But for observability in particular, do you think it will increase in importance in the next one to two years, maintain the same level of importance, or decrease in importance? Do you think it’s going to decrease so much that you think you’re going to phase out logs in the next one to two years, or do you think you’re going to phase out logs at some point, but you don’t know exactly when? What’s your take on this?

Alok Bhide: I mean, it’s a tough one. It also depends on which customer you speak with. In the monitoring space that I see, there are also different industries that care differently about logs and what their expectations are. If you talk to a security operator, there is a much higher dependence on logs for everything. Significantly higher.

If you look at the monitoring space, I think the one interesting trend I’m seeing, and I don’t think this is a 100% trend, is that certain customers have told me we are going to move away from logs completely. I think that’s an extreme statement. I don’t think they mean it quite as much as they say it. I don’t think that’s easy to do, but I have heard of it. And so there’s this: “Let’s move toward traces.” But I think there’s just like this desire to move away, not always a strategy to move away.

I don’t think there’s a ground swell desire to move away. Administration teams are stakeholders who worry about the cost of logs. That’s where it’s coming from. That’s one thing that I see. I do see that, therefore, the growth of that dependence on logs will go down. It won’t go away, it will go down. Bill, you might have a bit of a different point of view on this. Or maybe not. Maybe it’s the same.

Bill Hineline: I agree. I think the answer is that it depends, but I do think logs are here to stay. I think that the value of logs is dependent upon the organization’s ability to control their content and make sure they’re enriched and that they can be correlated well. I think that if I think about what my workflow has been in observability, it’s always been to look at traces and metrics first, but logs always provide that great additional context, especially if I know nothing about an app, which is often the case for a central observability team who are just trying to jump in and help by being experts in a tool, right?

But if I can look at an observability tool, look at a trace, see that there’s something going on and then look at that and a few log entries that are well correlated because they’re rich data allows then I can help find things much faster. And frankly, you make a support option really scalable for a large enterprise because people don’t all have to know everything about every app all the time. They just need to be experts in a tool.

How do you route logs?

Rachel Dines: All right, on to our third and final poll. Time to answer it. First, the question is, how do you route logs? Do you use open source routing methods like the OTel collector, or Fluent Bit, Logstash or Kafka? Do you use a built-in collector from a commercial off the shelf vendor like Splunk forwarders? Do you use a totally third party telemetry pipeline like Cribl? Or is there just not really a consistent or unified strategy, everyone does it a little bit differently? While you are exercising your fingers, I want to talk a little bit about telemetry pipeline space because it feels brand new, like it just came out of nowhere, like we weren’t talking about telemetry pipelines four or five years ago.

Now, I feel like a lot of people are talking about them. Eddie, I wanted you to give us a quick history lesson on how this space has evolved, how it came to be, and where it came from.

Eddie Dinel: It really did kind of come out of nowhere, which is sort of how everybody feels about it, like: “Oh my gosh, there’s this thing called a telemetry pipeline all of the sudden.” But it emerged from the fact that we are generating more logs from more places in more formats than ever before. I have talked to customers who are storing their logs in databases. I’ve talked to customers who are storing their logs in raw files. I’ve talked to customers streaming their logs, and pulling things from Kafka and so forth.

So, there’s a whole bunch of different sources of logs because application topologies and application architectures have changed. And so, instead of having one big process that does everything, or even a small number of processes that do everything, now you’ve got dozens or hundreds of processes and each of them generates a sort of baseline amount of logs, no matter what you do. You’ve got microservices that are going to generate a ton more logs than monolithic applications are. So as these application architectures evolve towards that kind of modern world … Your logs are going to explode in terms of volume and in terms of complexity.

Rachel Dines: Yeah, and this definitely underlines that this is why telemetry pipelines exist – to help with both of those problems. What’s interesting is that Gartner said that by 2026, they expect 40% of log telemetry will be processed through a telemetry pipeline product which is an increase … 2022 is the last time they made this prediction [which was] 10%. So, 10%, in 2022, of log data is processed through a pipeline, and they expect that to be up to 40% by 2026. So, let’s see how that compares to how you all voted in the poll.

It looks like not everyone’s voted yet. Don’t be afraid to say you don’t have a consistent strategy. I expected that to be the most common answer here. I’ll give you one last chance to answer the poll before we push it live. Three, two, one, alright! Let’s see the answers. Okay, so it looks like a little less than half of you are building the built-in collector from your commercial, off-the-shelf vendor.

That’s what I expected, and I’d be curious to throw this over to Eddie to get his point. 24% are using a third party telemetry pipeline. That’s a decent number. And then an equal 24% are also using open source routing methods.

And you saw from the word cloud earlier in the presentation that the open source solutions were the most common, right? The Fluent Bit, Logstash, OTel were the most common ways of collecting data. So Eddie, what do you think about these results?

Eddie Dinel: I think these are really interesting. I agree with you that it doesn’t entirely surprise me that folks are mostly using a log back-end vendor, like: “I’m going to just use their tooling and adapt with that.” But I think there’s opportunities here, particularly with the third party telemetry pipeline and the open source methodologies, to operate things in a little bit more efficient and flexible ways over time.

Rachel Dines: Yeah, and I see a comment too: “We’re migrating to OTel. Not there yet, but beginning the journey.” If I had a nickel for every time I heard that … I think a lot of people are in that spot. OTel is obviously a really powerful solution, you know, strongest in traces. DecentMetrics has a bit of a way to go for launch in our experience.

Enrich and transform your logs

Rachel Dines: So, let’s dive a little bit deeper into each one of these, starting with the Telemetry Pipeline, Eddie.

Eddie Dinel: Awesome, thanks. So fundamentally, there are a few things that we attack in this problem space. The first is what we sometimes like to call the “Tower of Babel problem.” You have a whole bunch of different formats and a whole bunch of different sources coming from one side. And, you have a whole bunch of different destinations that expect a whole bunch of different formats potentially on the other [side].

And you need some way to translate between them. You need some way to move from one to the other, and you need to move across those two things. But the second thing that I think is perhaps more important as you’re moving things from a number of different data sources to a number of different destinations, is that you can transform and enrich those logs along the way.

So, as was mentioned earlier, you may have context. At the time, the log is being written. That would be incredibly difficult to reconstruct later. Like: “You know, right now, this particular metric is above 90%. Boy, it would be interesting to decorate the log with that additional information on its way to the destination,” that if I needed to go back and look at it an hour later or two hours later or a day later, I could. I had that additional information associated with it. I might also want to reduce the amount of logs. Like, maybe I don’t actually need to store every time I write an HTTP 200 in my log. Maybe I just need to generate a summary that says: “Here’s all of those things.” And I’ll show a little bit more about that in a demo in a little bit.

So, I want to be able to enrich the logs, I want to be able to potentially transform them, and so forth. And the last thing that I think is useful to know, is that we’re building all of this on top of FluentBit. So, because we’re building on top of FluentBit, we’re building on top of a solution that has been downloaded and used over 13 billion times — we like to joke that it’s a little bit like the old McDonald’s hamburger signs, like billions and billions and billions served. This means that it is running on top of some of the most rock solid, battle tested software that’s out there. It’s incredibly performant, it’s incredibly efficient, and it’s a Kubernetes-native solution.

So, anything that you can do with Kubernetes; you know, your horizontal pod auto scaling, or your automatic load balancing, or if you’re self healing if one of the nodes goes down for some reason, all of that stuff comes for free as part of Kubernetes – building and designing in this way. So, that’s a high level overview of what we do with the telemetry pipeline.

Telemetry pipeline benefits

Rachel Dines: So, just to summarize, when customers work with us, whether they’re using the telemetry pipeline product or the logs product as part of the observability platform, two big benefits they see first is:

1. Reducing cost of log data.

Typically, we see customers cut their costs by at least 30%.That can definitely be higher than this. It depends on how aggressive you want to get. One of the customers I was working with recently, a large commercial bank, reduced their Splunk bill by millions of dollars. That was, once again, in a short period of time, something they were able to do. I think they could probably even go significantly deeper. On the streamlining operations side, this is where you can really save a ton of time from collecting and routing logs and transforming them. If you have logs in tons of different formats, which you likely do, that’s okay, right? You can help reduce the operational burden.

2. And similarly, that same bank that reduced their Splunk bill by millions of dollars, they are managing 20,000+ Fluent Bit and Fluentd agents in the field. So, also the fleet management side of it can help reduce the operations. And the customer I mentioned at the beginning of the conversation … They said it took them months to set up their pipelines. Well, now you can do it in minutes and I’m going to prove it.

I’d love to pass it back over to Eddie to do a pipeline demo. So, lightning demos: that means they get about three to five minutes each to show us as much of the product as they can and keep it exciting and interesting. Eddie, you’re up first!

Apply processing rules and make logs your friend

Eddie Dinel: Now, let’s see what we should do next. Let’s take out that log line, the original raw log line, because we don’t need that anymore now that we have the structure. And you can see that there are a whole bunch of other additional things that we could do here. We could block keys, we could block whole records.

We could do multi-line joins to bring things together and so forth. In this case, I’m just going to flatten the sub record down so that we remove some unnecessary nesting. You can see that what we end up with here, is now we’ve gone from a bunch of raw unstructured data to much more structured data that we can understand more easily when we’re going back and looking at it.

And we’ve reduced the total volume of the logs by quite a bit. But, we can also do some more fun stuff, like profile this process as well, and see how long it takes to run each of these things … So you can see, for example, that the block rule takes 1.4 milliseconds to run. You can see that the parse rule there, took one millisecond to run, and so on and so forth. And this lets me understand the more performant logs, performant rules, and so forth as I go through things.

So, when I think things are good, I can go ahead and apply the processing rules. Now, that processing rule has been deployed and I’ve got it set to go. So, that’s my lightning demo.

Watch the full conversation in the on-demand demo.