How logs can help to improve security and observability

Two people working at computers, intently focused on tasks. An overlay with a magnifying glass and heart rate icon highlights the importance of observability on the right side.
ACF Image Blog

This webinar recaps how Chronosphere’s Telemetry Pipeline streamlines your log data management and optimizes your data footprint.

Sophie Kohler | Content Writer | Chronosphere

Sophie Kohler is a Content Writer at Chronosphere where she writes blogs as well as creates videos and other educational content for a business-to-business audience. In her free-time, you can find her at hot yoga, working on creative writing or playing a game of pool.

21 MINS READ

Log data is skyrocketing

Log data has skyrocketed by 250% in just the past year alone. With sources multiplying, formats diversifying, and the number of destinations increasing, managing this data is becoming more costly and complex for teams.

In a recent webinar, Chronosphere Director of Product & Solutions Marketing, Riley Peronto, and Chronosphere Field Architect Anurag Gupta dive into how Chronosphere’s Telemetry Pipeline solution can streamline your log data management, optimize your data footprint, and alleviate the burden of telemetry collection and routing.

If you don’t have time to listen to the full chat, catch a transcript below.

Observability and security teams: What we’re seeing today

Riley Peronto: I’ll start off by talking a little bit about the Chronosphere Telemetry Pipeline and the challenges that we’re solving as an org. Then I’ll hand it off to Anurag to talk about why you might want pre-processing of your log data within your organization. Then, we’ll shift over to the actual demo. We’ve got use cases that are both basic and advanced, as well as some strategies to help you reduce log data inflate. With that, let’s talk a little bit about the challenges we solve as an organization. 

When we’re talking to security and observability teams, we see two things across the board. First, I see that infrastructure is increasing in complexity. So, what do I mean by that? Teams are sending data to more backends than they ever have before. That might be a log management or an observability platform. It could be a SIEM platform. 

You could even be sending data to a data warehouse or a data lake or something of that nature. And these tools, while there’s many of them, they’re adding a ton of value to security and observability practices. And they’re helping teams really get more value out of their logs. However, as you add more backends and parallel, you also need to add more ways to collect that log data and ship it downstream.

Management burden and battling complexity

This is where the complexity can start to creep in. What we see is that teams are adding more agents upstream than they’ve had before. And this makes it harder to enforce changes upstream. It makes it harder to configure everything in a consistent manner. It can make it hard to make sure that you’re getting all the logs you need without any blind spots.

Overall, this creates a big management burden that the engineering team or whoever is responsible for your logging infrastructure has to keep up with, and it can create a lot of burden for that team to keep up with over time. The second challenge we see is the sheer growth of log data overall. Last quarter, we surveyed over 270 folks who are responsible for their observability budgets. And we asked them how much data had grown over the past year. In a one year time frame, they reported back data had grown on average 250%. Huge astronomical growth that we’re seeing basically across the board. It’s probably pretty obvious why this is bad. When you think about a log management or a SIEM tool, you’re often being charged based on the volume of data you capture in that tool.

As this data grows and grows, it becomes impossible to put all that data in your log management platform without breaking your budget. So the question becomes: “How do I get my team the data they need to support their use cases without completely breaking my budget?” And that’s a really hard question to answer.

That’s what we’re seeing across the board. And let’s talk a little bit about how we solve this as an organization. So Chronosphere’s Telemetry Pipeline is a vendor agnostic platform that collects your data from any source. It preprocesses that data in flight and it streams it to any destination. In essence, you can think of it as basically a control plane that gives you control over your data from the point of its creation all the way to its final destination.

Introducing Chronosphere’s Telemetry Pipeline

Riley Peronto: A couple things I want to just double click on here. First, I use the term “vendor agnostic.” Chronosphere’s Telemetry Pipeline is a standalone product offering. You do not need to send your logs to the Chronosphere platform. We absolutely support that, but we also work with New Relic, Datadog, Elastic, whatever tools you have in your stack, we can stream logs to those outputs.

The second thing I want to spell out is that this platform is built on Fluent Bit. Our technical co-founder and Field Architect, Anurag Gupta, who is joining me in today’s webinar, is actually the creator of Fluent Bit. With that, the platform is built on open source and open standards. So there’s no new proprietary agents you need to deploy. There’s no vendor lock in of any sort.

It’s all open source and open standards. On the right hand side of the screen here, we list a few ways in which you can pre-process data with Chronosphere. That’s really the focal point of the webinar today. And it’s just the tip of the iceberg what I have on screen here. So I won’t get too deep into this. I do want to call out this reduced row though, so a common driver for adopting telemetry pipeline technology is to right size your data footprint to reduce logging costs. When customers approach us with this use case in mind, big and small customers have been able to save at least 30%, often 50%, or 60% on their logging costs, both through the reduction of cloud egress fees, as well as the reduction of indexing and retention in your log management platform.

So, we have a lot of really powerful outcomes when you’re using Chronosphere Telemetry Pipeline.

Now, before we dive into the actual demo, where we’re going to show you the different ways in which you can process data, I wanted to help you guys visualize how this technology might fit into your stack. This is actually an amalgam of a couple different customers that we work with. And what we did is, we looked for common infrastructure patterns just to give you a visualization of what this might look like before and after.

Streamlining observability migration

What we’re looking at here, the customer is, streaming log data to a few different backends here. They’re leveraging proprietary vendor specific log collection agents. They’re pushing their Windows and Microsoft Exchange logs through Logstash. They have a Kafka pipeline they need to maintain.They also have syslog outputs from their network devices and their firewalls. So if you think about this, take a step back. That’s five to six different resources they need to manage. And their team needs to make sure that everything’s configured correctly and is pushing logs to the right destination.

It’s just a big burden for teams to keep up with all of this work and all of these different resources. When you look at the after, effectively what we’ve done is we’ve consolidated collection and routing. All within the chronosphere telemetry pipeline. You can manage this all from one place and we’ve made everyone’s lives a lot easier in doing so – at the same time we’re pushing compute processing upstream so that you can manipulate or shape or transform your data in different ways, whether to better support your use cases and drive specific outcomes or to support the different backends you support. The one thing I want to call out here, if you might notice that we swapped out one of the backends for Dynatrace, what Chronosphere can effectively do is it allows you to collect data from one resource or one agent and push that to a different backend.

This can streamline the migration from one backend to another. Another thing I want to call out is just the efficiency at which we execute these use cases. We recently benchmarked our pipeline against a competing telemetry pipeline in the market. And in doing so, we performed a sizing calculation and we found that Chronosphere uses about 20x infrastructure 20x less infrastructure resources than this other pipeline.

What this means for customers, is that you’re spending less money, you have a lower TCO on your telemetry pipeline technology, you’re spending less time managing infrastructure resources, and you’re getting better performance as you’re pushing large volumes of data downstream. Overall, we’re able to do this in a very high performance, efficient manner.

With that, Anurag, I’ll hand it off to you and we can talk a bit more about specific processing use cases and we can dive into the actual demo.

Processing data with Chronosphere’s Telemetry Pipeline

Anurag Gupta: Great to chat with everyone today and I’m pretty excited to show the demos for what we can do with processing both security and observability data. And it really comes down to use cases. When we as Chronosphere think about processing, we think of it in two ways. One is: “How do we go address these use cases that you see on the screen? But the second is also, how do you make it so that folks in the organization have rapid access?” To go test, try, and build processing that can adhere to basic things, advanced things, or log reduction use cases. So, I’ll throw a little bit about that and how we think of it and why that’s different from some of the other players in the market. The first bit is — what is basic processing.

Processing is generally an overloaded term, right? If you think of a computer processor it’s doing everything and anything under the sun. And so the way that we like to think about it is if I have a stream of data coming in, that data might need to be changed a little bit, have some additional fields added, some context to make it faster for debugging and easier to understand. We might want to do a little bit of parsing, and what that means is taking that log and, or taking that data stream and deriving it into different keys and keys and fields. We might do some, or a redaction, we might do some additional transformations on top of that really changing the data as it comes through.

Really your very basic, I have one message in, one message out type of use case. Now, advanced processing is where things start to get fun, and, if folks are in the security play market and you’ve heard about this OCSF, open comma schema format for security logs, or you have heard of other schemas like SAF, CEF, or other things that your SIEM requires, those schemas can be pretty heavy to, to go and enact, and typically what you’ll find is.

You have to use this proprietary method, this bespoke integration to get that schema in a way that your detections, your alerts are going to be as efficient as possible. The nice thing with what we have out of the basic processing, these out of the box templates is we give you some of that stuff.

You can easily test, try things like Windows events and see what those transformations look like to get it into those schemas. The second is enrichments from other data sources. We think of your basic process, it’s just one message in, one message out. Advanced processing can say one message will look up against an API.

Go retrieve a CSV file that has some indicator of compromise type scenario. And then use that as part of the data feed. And then last but not least, when we talk about log reduction: Log reduction can be made up of both these basic and advanced use cases. How do we do some small transformations that can be really impactful? I’ll show a couple of examples of that. And then two, with some of the advanced processing where you’re essentially doing some stream processing, like aggregation, dedupe, sampling. How can we summarize that to, really make an impact on some of the costs that we’ve started to incur. Those are the use cases for processing. I’m going to dive straight into the demo here.

 

Video Thumbnail

Anurag Gupta: Now, in order to do processing, each of our pipelines, again, acts as an independent object. They have their own scale, they have their own health, they have their own storage, retry, all of these very familiar assets when you’re thinking about: “How do I do a lot of this data movement at scale?” And to give a sense of what sizing looks like for one of our pipelines that is receiving data, you can use one vCPU and that can handle about two terabytes of data in and two terabytes of data out with some processing. So very effective, very efficient especially if you’re doing a lot of scaling.

Now this is the pipeline screen where I can configure my sources, I can configure my destinations. And because we allow you to have multiple pipelines, within a single core instance, you can really make it, make these very simple. I can have one source, maybe two, three destinations. You’re not limited to having every source and every destination within a single pipeline. In fact, that’s counterintuitive to what you may want. If you don’t want one pipeline to affect another, you might want to scale them differently. Data going to your sim is super priority. Can’t have any of that missing, but data going to maybe your data lake, eh, if it doesn’t make it. Okay, that’s fine.

We have a little bit less of an SLA on some of that. Now in this pipeline, what you’ll see is you have these two little bubbles for processing rules, and every source has a processing rule and every destination has the ability to add a processing rule. And so, if I click into this bubble, I’m going to launch our essentially processing rule IDE, where we do the logic, do the transformations and really get to, to play around with what we want to accomplish. Now, for folks who might have used Fluent Bit, Fluentd in the past, many times you’re left guessing what your output is, right? You’re going to change your config, you’re going to do the transformation, and then you might pass a log message through and say, okay, what does it look like on the other end?

And I know, at least from my experience, when doing some consulting with some large financial institutions, we would sit two laptops open, and I would fire a message over and the person on the other side would say: “ Okay, this is what it looks like. That’s effectively what we’ve built here with this UI. Here’s the laptop on the left of your inputs, here’s your laptop on the right with outputs. And we can very quickly, sub 200 millisecond, try to understand what the changes that we want to enact on this. Let’s take that message that’s coming in that dummy message, and we’ll just pretend it’s like these line messages. And we’re gonna do a couple of actions on top. So these are all the inbuilt actions.

They let you do things like add, set keys let you do things like block records, delete, parse, split and even custom actions, which we’ll get to. And for this one, I’m going to do something really simple. Your most basic processing, I’m going to add a region, and I’m going to call it US East 1. I’m going to add a comment so folks know what I’m doing. Add region, messages, click apply. And the nice thing here is now we can test that. Great, this is what it’s going to look like. Our log message looks like this. Here’s this new field. We can apply, save, and deploy. And that gets pushed out to the pipeline that’s there.

Looking at the log data

Anurag Gupta: The nice thing about that is the pipeline only has one outbound connection to our management plane. It’s going, it’s checking if I’ve enacted a new processing rule. It goes, pulls that down and then restarts the pipeline to now adhere to that new processing piece. So here we go. Awesome. We’re now seeing the new message that’s being pumped out into the console. Got my region added in here. Your most basic level of processing. We tested it before we launched it. Now it’s launched, it’s running, it’s in production. Now we can add additional actions on top of this. So let’s say for example, actually we’ll go back and grab.

A quick set of these logs so we can really feel what the transformation is going to look like. So, let’s load up these logs again. Excellent. I’m going to just do a bit of a copy paste. And we’ll go back into the edit, edit the process here a little, so while here I have the line one through ten, I can switch this from raw into JSON, add in the messages that we were having before, and I can see what the impact of doing some of these rules are going to be live. One thing you’ll also notice when I switched from raw to JSON, it automatically added that log key there. And this is important, too, when we think of processing. 

Processing raw data versus normalized data is always a bit of a toss up, and one thing that we did here is we said, it doesn’t matter if it’s coming in as raw. We will create the most basic level of a schema here.

Video Thumbnail

Log is your key, the value being the raw value that you send us. It doesn’t matter here if I paste in raw logs, it doesn’t matter if I send raw logs which we’ll do in the other pipeline, we’re automatically assigning the most basic level of a schema.

Anurag Gupta: Right now I have this blob of data. I knew that the HTTP error code existed in them, but really sometimes we don’t know what exists within them. So, we want to do a little bit more transformation. We want to get a little understanding of this. So to do that, I’ll go ahead and choose the source key, which is logged because we’re automatically assigning that normalization. Destination key, I’ll click something like, or I’ll choose something like parsed. And then here I can insert my regular expression. I’ll just grab one from our Perlster library, so the same playground. Paste that. Just delete the first character and the last character, say, parse HTTP, click apply, and we can run that.

Reducing log data

Anurag Gupta: And now, we have key values. So I have a parsed field that has a bunch of subkeys here, like your path, user, code and I can see how many of these error codes might be 404s, 500s et cetera. You can see my reduction also went down. It went from 89 percent to 74%, but now I have a lot of views and a lot of insights into what’s going on with the logs that are more meaningful than just the blob of text that exists. Now let’s go to do a little bit more transformation on top of this. Let’s go do maybe a delete key. Let’s get rid of that raw log field, delete raw, apply, run that and then you’ll notice, right? It’s still nested under this other field. The top level event is parsed and then all the fields underneath as a sub value. So, what I’ll do on top of this is we’re going to flatten. So, let’s go ahead and flatten this record and choose parsed.

And great, now everything is coming in as a kind of flattened record here. Taking an HTTP access log, doing the transformations on top, deleting the stuff that we don’t need. Now, we have way more visibility into the keys of what is going on, and yes, we don’t have that 89 percent reduction, but we’re still at 86, so it’s still really good comparatively to what we had before. So, from a base processing level, we’ve done some key value extractions, done some deletions, done some flattening. And the best part is we’ve done this all within the environment. We haven’t even deployed anything in a pipeline yet. We can save this as a template. We can deliver that to the folks who are in charge of the pipeline, whether that’s ourselves or someone else.

And then you can iterate on it. It’ll save the input for you. And so, you can almost use it as a little bit of a unit test in some ways. Let’s go do a bit more advanced processing on this. Now, the way that we’re going to do the advanced processing here you’ll notice that we have some additional actions, things like aggregate records. Now, aggregate records will take a certain amount of time, all the records that come in and then allow us to use like select and some small computations like count, average and so here we can do maybe a slight aggregation let’s go do a select key. And go ahead and load up an example here for getting the best way of doing this.

Okay, perfect. Yeah. So we’ll do the select key. Maybe let’s select the code. So if it’s a 400 or is it a 500, how many of those do we get? And then here we’ll put the computation, the count, how it counts. And we can then also, we’ll start with that here. So we’ll do a little bit of the computation and we’ll just select the code plus the count. Go ahead and run that. And this is taking place now in every time limit of 50. So, it’s looking at 50 records and then giving us a summary of those 50 over time…If we want to do, for example, the entirety of all of these into a single aggregation.

We can do that too, just increasing the time window.

The time window, when you deploy it in a pipeline, uses seconds. In this kind of environment, it uses the number of records as the time window. So you can mock that efficiently. So, here we have fifteen 301s. So, really great for us to see what’s going on, aggregate. We can use this processing on a specific output. We could use this processing. on a specific destination. So let’s do this. In fact, let’s save this. We’ll call this Aggregate HTTP, save that. Perfect. That’s saved for us. And now you’ll notice in the playground, I can also design and build pipelines here. So we’ll choose maybe a source coming in from let’s call it OpenTelemetry.

And then let’s send this destination out to maybe two or more. We’ll say from here, let’s send it to Chronosphere. Modify this little bit here, get rid of this, and then we’ll also send this over to I don’t know, Clickhouse, if you will. So we can add a Clickhouse post or something like that. I’m just going to fill it in. And maybe for something like Clickhouse, we’re going to add that aggregation processing rule. We’re going to say: “Hey, for this aggregation, we’re going to only use. The same aggregation that we had before, but for the Chronosphere logs, we’re going to use the raw stream of data coming in.” So, you get to choose which both destinations will receive the data. And then we can say: “Hey, for this particular destination, only save, only send 50%. So, doing this upstream has a number of benefits. Number one, one of the most hidden costs of most cloud providers is this egress and data transfer charge. So, if you send data out of your cloud, you’re going to get charged for doing that. And if you’re sending between availability zones, you’re actually getting charged for all of that data as well. So if you can reduce the amount of data at the source level, in your VPC, in your environment, that’s going to help give you savings. 

Catch up on the rest of the chat, followed by a Q&A session at minute 32:30.

On-Demand Demo: Transform Logs to Elevate Observability & Security

Share This: