Avoiding data loss and backpressure problems with Fluent Bit

A white icon of a dam with flowing water is centered on a green background with digital wave patterns on the left side, symbolizing structures addressing critical issues like data loss and backpressure in systems such as Fluent Bit.
ACF Image Blog

A practical guide on how to detect and avoid backpressure problems with Fluent Bit by balancing memory-based and filesystem-based buffering.

Sharad Regoti, with short dark hair and a beard, smiles at the camera while wearing a blue t-shirt.
Sharad Regoti | Guest Author

Sharad Regoti is a CKA & CKS certified software engineer based in Mumbai.

12 MINS READ

Understanding backpressure

Fluent Bit is a widely used open-source data collection agent, processor, and forwarder that enables you to collect logs, metrics, and traces from various sources, filter and transform them, and then forward them to multiple destinations. With over ten billion Docker pulls, Fluent Bit has established itself as a preferred choice for log processing, collecting, and shipping.

At its core, Fluent Bit is a simple data pipeline consisting of various stages, as depicted below.

Diagram of a data processing pipeline showcasing input leading to parser, followed by filter, then buffer with backpressure managed by Fluent Bit, proceeding to routing which directs to Output 1, Output 2, and Output N.

Most data pipelines eventually suffer from backpressure, which occurs when data is ingested at a higher rate than the ability to flush it. Network failures, latency, or third-party service failures are common causes of backpressure. The problems that result from backpressure include high memory usage, service downtime, and data loss, all of which we would like to avoid.

Fluent Bit offers special strategies to deal with backpressure to help ensure data safety and reduce downtime. Recognizing when Fluent Bit is experiencing backpressure and knowing how to address it is crucial for maintaining a healthy data pipeline.

This post provides a practical guide on how to detect and avoid backpressure problems with Fluent Bit.

Prerequisites

  • Docker installed on your local machine.
  • Familiarity with Fluent Bit concepts such as inputs, outputs, parsers, and filters. If you’re not familiar with these concepts, please refer to the official documentation.

Fluent Bit’s default: memory-based buffering

In Fluent Bit, records are emitted by an input plugin. These records are then grouped together by the engine into a unit called a Chunk, which typically has a size of around 2MB. Based on the configuration, the engine determines where to store these Chunks. By default, all Chunks are created in memory.

With the default mechanism, Fluent Bit will store data in memory as much as possible. This is the fastest mechanism with the least system overhead, but in certain scenarios, data can be ingested at a higher rate than the ability to flush it to some destinations. This generates backpressure, leading to high memory consumption in the service.

In a high-load environment with backpressure, there’s a risk of increased memory usage, which can lead to the kernel terminating the Fluent Bit process.

Let’s create some backpressure

Let’s demonstrate a situation in which Fluent Bit is running in a constrained environment. Our goal is to create enough backpressure and memory usage that the Fluent Bit service is killed by Kernel (OOM).

Clone the samples git repository

This repository contains all the required configuration files. Use the command below to clone the repository.

git clone [email protected]:chronosphereio/calyptia-blog-posts.git
cd blog-posts

Evaluate the default RAM consumption

docker run  -v $(pwd)/fluent-bit-empty.conf:/fluent-bit/etc/fluent-bit.conf:ro   -ti cr.fluentbit.io/fluent/fluent-bit:2.2

The above command runs Fluent Bit in a Docker container with an empty configuration file; this lets us evaluate the default memory consumption and memory limits for the container.

Use the docker ps command to get the name or container ID of the newly created container and use that value in the below command to get its stats.

docker stats <container-id-or-container-name>

The image below shows that by default, the container consumes 10MB of RAM and can take up to 8GB of system RAM.

Screen displaying container details: ID, NAME, CPU usage, memory usage and limit, net I/O, block I/O, and PIDs. Red annotations point out 90MB usage and 8GB memory limit. Utilizing Fluent Bit to prevent data loss during high-load periods ensures accurate backpressure management.

Create the Fluent Bit configuration

[INPUT]
  Name   dummy
  copies 1500
  dummy  {"host":"31.163.219.152"...} # A large JSON object, refer git repository
  tag    dummy-a

[OUTPUT]
  name   stdout
  match  *

The above configuration uses the dummy input plugin to generate large amounts of data for test purposes. We have configured it to generate 1500 records per second. This data is then printed to the console using the stdout output plugin.

Simulate backpressure

To simulate OOM kill behavior caused by backpressure, we will generate data at a higher rate while restricting the container RAM to just 20MB. With this configuration, as Fluent Bit tries to buffer more data in memory, it will eventually hit the container’s imposed RAM limit, and the service will crash.

Execute the below command and observe the result.

docker run --memory 20MB  -v $(pwd)/fluent-bit-oom.conf:/fluent-bit/etc/fluent-bit.conf:ro   -ti cr.fluentbit.io/fluent/fluent-bit:2.2

After a few seconds, the container will stop automatically. Once it stops, grab the container ID using docker ps -a and inspect the container. You should observe that it was killed due to a container out-of-memory error.

Container resource usage statistics with a RAM limit imposed and optimized to prevent data loss. CPU: 8.73%, Memory: 17.57MiB of 261MiB, Network I/O: 736B/6B, Block I/O: 0/0, PIDs: 4.

Terminal screenshot showing a Docker command with results indicating "OOMKilled": true. A red arrow points to a comment, "App killed due to OOM," highlighting the need for Fluent Bit integration to prevent data loss under backpressure conditions.

This demonstrates how backpressure in Fluent Bit leads to increased memory usage. Upon reaching memory limits, the kernel terminates the application, which could cause downtime and data loss.

If your Fluent Bit process is continuously getting killed, it is likely an indication that Fluent Bit is experiencing backpressure. In the following section, we’ll explore a solution to this problem.

A quick fix: limiting memory-based buffering

A workaround for backpressure scenarios like the above is to limit the amount of memory in records that an input plugin can register. This can be configured using the mem_buf_limit property. If a plugin has enqueued more than the mem_buf_limit, it won’t be able to ingest more until that buffered data is delivered.

When the set limit is reached, the specific input plugin gets paused, halting record ingestion until resumed, which inevitably leads to data loss.

When an input plugin is paused, Fluent Bit logs the information on the console with an example shown below:

[input] tail.1 paused (mem buf overlimit)
[input] tail.1 resume (mem buf overlimit)

The workaround of mem_buf_limit is good for certain scenarios and environments. It helps to control the service’s memory usage, but at the cost of data loss. This can happen with any input plugin.

The goal of mem_buf_limit is memory control and survival of the service. Let’s see what happens when we modify our Fluent Bit configuration by adding the mem_buf_limit property to our input plugin.

[INPUT]
  Name   dummy
  copies 1500
  dummy  {"host":"31.163.219.152"...} # A large JSON object, refer git repository
  tag   dummy-a
  Mem_Buf_Limit 10MB

[OUTPUT]
  name               stdout
  match              *

We’ve set a 20MB memory limit for the container. With Fluent Bit using 10MB in its default configuration, we allocated an additional 10MB as mem_buf_limit.

Note: The combined memory limit assigned to each input plugin, must be lower than the resource restriction placed on the container.

Execute the below command and observe the result.

docker run --memory 20MB  -v $(pwd)/fluent-bit-memory-limit.conf:/fluent-bit/etc/fluent-bit.conf:ro   -ti cr.fluentbit.io/fluent/fluent-bit:2.2

Unlike the previous scenario, the container is not killed and continues to emit dummy records on the console. Grab the container ID using docker ps and execute the below command.

docker logs <container-id> | grep -i "pausing\\|resume"

Docker logs showing repeated entries with timestamps indicate input plugin pausing and resuming due to memory buffer overload, a symptom of backpressure. Annotations highlight the significance of these pause and resume messages to prevent data loss.

The above image indicates that as Fluent Bit reaches the 10MB buffer limit of the input plugin, it pauses ingesting new records, potentially leading to data loss. However, this pause prevents the service from being terminated due to high memory usage. Upon buffer clearance, the ingestion of new records resumes.

If you see the logs above in Fluent Bit, it is a sign that Fluent Bit is hitting the configured memory limits at input plugins due to backpressure. Check out this blog post on configuring alerts from logs using Fluent Bit.

In the next section, we will see how to achieve both data safety and memory safety.

A permanent fix: filesystem-based buffering

Filesystem buffering provides control over backpressure and can help guarantee data safety. Memory and filesystem buffering approaches are not mutually exclusive. When filesystem buffering is enabled for your input plugin, you are getting the best of both worlds: performance and data safety.

When Filesystem buffering is enabled, the behavior of the engine is different. Upon Chunk creation, the engine stores the content in memory and also maps a copy on disk (through mmap(2)). The newly created Chunk is (1) active in memory, (2) backed up on disk.

How does the filesystem buffering mechanism deal with high memory usage and backpressure?

Fluent Bit controls the number of Chunks that are up in memory. By default, the engine allows us to have 128 Chunks up in memory, this value is controlled by service property storage.max_chunks_up. The active Chunks that are up are ready for delivery.

Any other remaining Chunk is in a down state, which means that it is only in the filesystem and won’t be up in memory unless it is ready to be delivered. Remember, chunks are never much larger than 2 MB, thus the default storage.max_chunks_up value of 128, each input is limited to roughly 256 MB of memory.

If the input plugin has enabled storage.type as filesystem, when reaching the storage.max_chunks_up threshold, instead of the plugin being paused, all new data will go to Chunks that are down in the filesystem. This allows us to control the service’s memory usage and also guarantees that the service won’t lose any data.

Configuring filesystem-based buffering

Let’s modify our Fluent Bit configuration by enabling filesystem buffering.

[SERVICE]
  flush                     1
  log_Level                 info
  storage.path              /var/log/flb-storage/
  storage.sync              normal
  storage.checksum          off
  storage.storage.max_chunks_up 5

[INPUT]
  Name   dummy
  copies 1500
  dummy  {"host":"31.163.219.152"...} # A large JSON object, refer git repository
  tag   dummy-a
  storage.type filesystem

[OUTPUT]
  name               stdout
  match              *

In the above configuration, we have added storage.type as filesystem in our input plugin and a [SERVICE] block to configure storage.storage.max_chunks_up to 5 (~10MB)

Execute the below command and observe the result.

docker run --memory 20MB  -v $(pwd)/fluent-bit-filesystem.conf:/fluent-bit/etc/fluent-bit.conf:ro   -ti cr.fluentbit.io/fluent/fluent-bit:2.2

Unlike the default scenario, the container does not crash and continues to emit dummy records on the console. When the storage.storage.max_chunks_up limit is reached, the chunks are backed up in the filesystem and delivered once the memory is free.

Note: While file system-based buffering helps prevent container crashes due to backpressure, it introduces new considerations related to filesystem limits. It’s important to know that just as memory can be exhausted, so can filesystem storage. When opting for filesystem-based buffering, it is essential to incorporate a plan that addresses potential filesystem-related challenges.

Conclusion

This guide has explored effective strategies to manage backpressure and prevent data loss in Fluent Bit. We’ve highlighted the limitations of default memory-based buffering and how Mem_Buf_Limit is a quick fix to balance memory usage. The ultimate solution, filesystem-based buffering, offers a comprehensive approach, ensuring data safety and efficient memory management. These techniques are essential for optimizing Fluent Bit in high-throughput environments, ensuring robust and reliable log processing.

A meme showing a hand balancing a dagger labeled 'Memory Safety,' 'Backpressure,' and 'Data Loss,' followed by a logo of Fluent Bit with the text: "Perfectly balanced... As all things should be.

For more information on this topic, refer to the documentation below

Get started with Fluent Bit Academy

Illustration of a bird wearing a graduation cap and flying in front of a chalkboard with the text "FLUENT BIT ACADEMY," addressing how to solve backpressure problems.

To continue expanding your Fluent Bit knowledge, check out Fluent Bit Academy. It’s filled with on-demand videos guiding you through all things Fluent Bit— best practices and how-to’s on advanced processing rules, routing to multiple destinations, and much more. Here’s a sample of what you can find there:

  • Getting Started with Fluent Bit and OpenSearch
  • Getting Started with Fluent Bit and OpenTelemetry
  • Fluent Bit for Windows

 

About Fluent Bit and Chronosphere

With Chronosphere’s acquisition of Calyptia in 2024, Chronosphere became the primary corporate sponsor of Fluent Bit. Eduardo Silva — the original creator of Fluent Bit and co-founder of Calyptia — leads a team of Chronosphere engineers dedicated full-time to the project, ensuring its continuous development and improvement.

Fluent Bit is a graduated project of the Cloud Native Computing Foundation (CNCF) under the umbrella of Fluentd, alongside other foundational technologies such as Kubernetes and Prometheus. Chronosphere is also a silver-level sponsor of the CNCF.

Share This: