Profiling & Benchmarking

How-to: Profiling Memory Usage in Pipelie Mode

The Python ecosystem offers a variety of tools for profiling memory usage. For the logprep project, additional requirements include support for multiprocessing, multithreading, and compiled extension modules. Memray is a modern, open-source memory profiling tool that meets these requirements. It is easy to use, performant, and supports all the necessary features.

This guide describes how to profile logprep running in pipeline mode between Kafka and OpenSearch. While Kafka, OpenSearch, and other components are deployed using Docker containers with ports exposed on the host, logprep runs directly on the host using memray run as the entry point.

The following diagram illustrates the high-level setup:

memory-profiling-setup.drawio.html

—

Install Dependencies

Ensure all required dependencies are installed on the host:

Memray: Install using pip install memray in the virtual environment where logprep runs.
Logprep: Install using pip install . && pip install ".[dev]" in the same virtual environment.
Docker / Podman: Refer to the respective docs for detailed instructions.

Prepare the Execution Environment

Start the Docker containers and wait for them to be fully operational:

docker compose -f examples/compose/docker-compose.yml up -d

Prepare a Test Scenario

Set up a logprep configuration and test inputs to trigger the desired functionality. Simplify the setup to minimize noise in the system.

Logprep Configuration

Use examples/exampledata/config/pipeline.yml as a starting point for profiling Kafka input, basic processing, and OpenSearch output.
Set process_count to 1 and reduce backlog sizes to further simplify the setup.

Log Data

Use examples/exampledata/input_logdata/logclass/test_input.jsonl as a starting point for generating log events.
Make adequate modifications to focus on the desired functionality.

Configure Memray

Select the appropriate Memray configuration options based on the code being profiled (main process, subprocesses, or native code). For this guide, we use the following options to track potential memory issues in the pipeline process, including native code:

--follow-fork
--trace-python-allocators
--native

Additional CLI Options

--follow-fork: To follow forked subprocesses. https://bloomberg.github.io/memray/run.html#tracking-across-forks
--trace-python-allocators: To track individual object allocations (more data, slower, helpful for tracking memory leaks). https://bloomberg.github.io/memray/run.html#python-allocator-tracking
--native: To track allocations in extension modules as well (more data, slower). https://bloomberg.github.io/memray/run.html#native-tracking
--aggregate: To only store aggregated data, speeding up analysis (but only after program termination). https://bloomberg.github.io/memray/run.html#aggregated-capture-files
--live: To display a live allocation screen. https://bloomberg.github.io/memray/run.html#live-tracking

Run the Experiment

The experiment requires multiple terminals to manage parallel processes.

Terminal 1: Run the Profiler

# Prepare a simple configuration to profile relevant code sections.
PIPELINE_CONFIG=examples/exampledata/config/pipeline.yml

# Run the profiler
memray run --follow-fork --trace-python-allocators --native logprep/run_logprep.py run $PIPELINE_CONFIG

# Identify the runner PID ($RUNNER) and, if applicable, the PID of the subprocess ($SUBPROCESS) from the logs.
# Alternatively, pipe the output to a file and search for "Pipeline" to find the subprocess PID.

Terminal 2: Generate Continuous Reports

# For the main process
DUMP_FILE=logprep/memray-run_logprep.py.$RUNNER.bin

# For a subprocess
DUMP_FILE=logprep/memray-run_logprep.py.$RUNNER.bin.$SUBPROCESS

while true; do
   TIMESTAMP=$(date +%s)
   REPORT_FILE="${DUMP_FILE}-${TIMESTAMP}.html"
   memray flamegraph $DUMP_FILE --leaks -o $REPORT_FILE --force && open $REPORT_FILE
   # ^ generated reports are opened automatically in the browser
done

Terminal 3: Continuously Produce Test Input

# Prepare log messages to trigger the desired functionality.
KAFKA_TEST_INPUT=examples/exampledata/input_logdata/logclass/test_input.jsonl

while true; do
   docker exec -i kafka kafka-console-producer.sh --bootstrap-server 127.0.0.1:9092 --topic consumer < $KAFKA_TEST_INPUT
done

Note

Memray supports multiple report formats. For more details, see the Memray Flamegraph Documentation.