Profiling & Benchmarking
How-to: Profiling Memory Usage in Pipelie Mode
The Python ecosystem offers a variety of tools for profiling memory usage. For the logprep project, additional requirements include support for multiprocessing, multithreading, and compiled extension modules. Memray is a modern, open-source memory profiling tool that meets these requirements. It is easy to use, performant, and supports all the necessary features.
This guide describes how to profile logprep running in pipeline mode
between Kafka and OpenSearch.
While Kafka, OpenSearch, and other components are deployed using Docker containers
with ports exposed on the host,
logprep runs directly on the host using memray run as the entry point.
The following diagram illustrates the high-level setup:
—
Install Dependencies
Ensure all required dependencies are installed on the host:
Memray: Install using
pip install memrayin the virtual environment where logprep runs.Logprep: Install using
pip install . && pip install ".[dev]"in the same virtual environment.Docker / Podman: Refer to the respective docs for detailed instructions.
Prepare the Execution Environment
Start the Docker containers and wait for them to be fully operational:
docker compose -f examples/compose/docker-compose.yml up -d
Prepare a Test Scenario
Set up a logprep configuration and test inputs to trigger the desired functionality. Simplify the setup to minimize noise in the system.
Logprep Configuration
Use
examples/exampledata/config/pipeline.ymlas a starting point for profiling Kafka input, basic processing, and OpenSearch output.Set
process_countto 1 and reduce backlog sizes to further simplify the setup.
Log Data
Use
examples/exampledata/input_logdata/logclass/test_input.jsonlas a starting point for generating log events.Make adequate modifications to focus on the desired functionality.
Configure Memray
Select the appropriate Memray configuration options based on the code being profiled (main process, subprocesses, or native code). For this guide, we use the following options to track potential memory issues in the pipeline process, including native code:
--follow-fork--trace-python-allocators--native
Additional CLI Options
--follow-forkTo follow forked subprocesses. https://bloomberg.github.io/memray/run.html#tracking-across-forks
--trace-python-allocatorsTo track individual object allocations (more data, slower, helpful for tracking memory leaks). https://bloomberg.github.io/memray/run.html#python-allocator-tracking
--nativeTo track allocations in extension modules as well (more data, slower). https://bloomberg.github.io/memray/run.html#native-tracking
--aggregateTo only store aggregated data, speeding up analysis (but only after program termination). https://bloomberg.github.io/memray/run.html#aggregated-capture-files
--liveTo display a live allocation screen. https://bloomberg.github.io/memray/run.html#live-tracking
Run the Experiment
The experiment requires multiple terminals to manage parallel processes.
Terminal 1: Run the Profiler
# Prepare a simple configuration to profile relevant code sections.
PIPELINE_CONFIG=examples/exampledata/config/pipeline.yml
# Run the profiler
memray run --follow-fork --trace-python-allocators --native logprep/run_logprep.py run $PIPELINE_CONFIG
# Identify the runner PID ($RUNNER) and, if applicable, the PID of the subprocess ($SUBPROCESS) from the logs.
# Alternatively, pipe the output to a file and search for "Pipeline" to find the subprocess PID.
Terminal 2: Generate Continuous Reports
# For the main process
DUMP_FILE=logprep/memray-run_logprep.py.$RUNNER.bin
# For a subprocess
DUMP_FILE=logprep/memray-run_logprep.py.$RUNNER.bin.$SUBPROCESS
while true; do
TIMESTAMP=$(date +%s)
REPORT_FILE="${DUMP_FILE}-${TIMESTAMP}.html"
memray flamegraph $DUMP_FILE --leaks -o $REPORT_FILE --force && open $REPORT_FILE
# ^ generated reports are opened automatically in the browser
done
Terminal 3: Continuously Produce Test Input
# Prepare log messages to trigger the desired functionality.
KAFKA_TEST_INPUT=examples/exampledata/input_logdata/logclass/test_input.jsonl
while true; do
docker exec -i kafka kafka-console-producer.sh --bootstrap-server 127.0.0.1:9092 --topic consumer < $KAFKA_TEST_INPUT
done
Note
Memray supports multiple report formats. For more details, see the Memray Flamegraph Documentation.