Configuration

Configuration is done via YAML or JSON files or http api resources. Logprep searches for the file /etc/logprep/pipeline.yml if no configuration file is passed.

You can pass multiple configuration files via valid file paths or urls.

Valid Run Examples
logprep run /different/path/file.yml
logprep run http://url-to-our-yaml-file-or-api
logprep run http://api/v1/pipeline http://api/v1/addition_processor_pipline /path/to/connector.yaml

Security Best Practice - Configuration - Combining multiple configuration files

Consider when using multiple configuration files logprep will reject all configuration files if one can not be retrieved or is not valid. If using multiple files ensure that all can be loaded safely and that all endpoints (if using http resources) are accessible.

Security Best Practice - Configuration - Authenticity and Integrity

Ensure that all configuration files are retrieved from trusted sources and have not been tampered with. Use tls to encrypt the transmission of configuration files and use authentication described in Authentication for HTTP Getters to ensure confidentiality and integrity.

Configuration File Structure

Example of a complete configuration file
version: config-1.0
process_count: 2
restart_count: 5
timeout: 5
logger:
    level: INFO
input:
    kafka:
        type: confluentkafka_input
        topic: consumer
        offset_reset_policy: smallest
        kafka_config:
            bootstrap.servers: localhost:9092
            group.id: test
output:
    kafka:
        type: confluentkafka_output
        topic: producer
        flush_timeout: 30
        send_timeout: 2
        kafka_config:
            bootstrap.servers: localhost:9092
pipeline:
- labelername:
    type: labeler
    schema: examples/exampledata/rules/labeler/schema.json
    include_parent_labels: true
    rules:
        - examples/exampledata/rules/labeler/rules

- dissectorname:
    type: dissector
    rules:
        - examples/exampledata/rules/dissector/rules

- dropper:
    type: dropper
    rules:
        - examples/exampledata/rules/dropper/rules
        - filter: "test_dropper"
        dropper:
            drop:
            - drop_me
        description: "..."

- pre_detector:
    type: pre_detector
    rules:
        - examples/exampledata/rules/pre_detector/rules
    outputs:
        - opensearch: sre
    tree_config: examples/exampledata/rules/pre_detector/tree_config.json
    alert_ip_list_path: examples/exampledata/rules/pre_detector/alert_ips.yml

- amides:
    type: amides
    rules:
        - examples/exampledata/rules/amides/rules
    models_path: examples/exampledata/models/model.zip
    num_rule_attributions: 10
    max_cache_entries: 1000000
    decision_threshold: 0.32

- pseudonymizer:
    type: pseudonymizer
    pubkey_analyst: examples/exampledata/rules/pseudonymizer/example_analyst_pub.pem
    pubkey_depseudo: examples/exampledata/rules/pseudonymizer/example_depseudo_pub.pem
    regex_mapping: examples/exampledata/rules/pseudonymizer/regex_mapping.yml
    hash_salt: a_secret_tasty_ingredient
    outputs:
        - opensearch: pseudonyms
    rules:
        - examples/exampledata/rules/pseudonymizer/rules
    max_cached_pseudonyms: 1000000

- calculator:
    type: calculator
    rules:
        - filter: "test_label: execute"
        calculator:
            target_field: "calculation"
            calc: "1 + 1"

The options under input, output and pipeline are passed to factories in Logprep. They contain settings for each separate processor and connector. Details for configuring connectors are described in Output and Input and for processors in Processors.

It is possible to use environment variables in all configuration and rule files in all places. Environment variables have to be set in uppercase and prefixed with LOGPREP_, GITHUB_, PYTEST_ or CI_. Lowercase variables are ignored. Forbidden variable names are: ["LOGPREP_LIST"], as it is already used internally.

Security Best Practice - Configuration - Environment Variables

As it is possible to replace all configuration options with environment variables it is recommended to use these especially for sensitive information like usernames, password, secrets or hash salts. Examples where this could be useful would be the key for the hmac calculation (see input > preprocessing) or the user/secret for the opensearch connectors.

The following config file will be valid by setting the given environment variables:

pipeline.yml config file with environment variables
version: $LOGPREP_VERSION
process_count: $LOGPREP_PROCESS_COUNT
timeout: 0.1
logger:
    level: $LOGPREP_LOG_LEVEL
$LOGPREP_PIPELINE
$LOGPREP_INPUT
$LOGPREP_OUTPUT
setting the bash environment variables
export LOGPREP_VERSION="1"
export LOGPREP_PROCESS_COUNT="1"
export LOGPREP_LOG_LEVEL="DEBUG"
export LOGPREP_PIPELINE="
pipeline:
    - labelername:
        type: labeler
        schema: examples/exampledata/rules/labeler/schema.json
        include_parent_labels: true
        rules:
            - examples/exampledata/rules/labeler/rules"
export LOGPREP_OUTPUT="
output:
    kafka:
        type: confluentkafka_output
        topic: producer
        flush_timeout: 30
        send_timeout: 2
        kafka_config:
            bootstrap.servers: localhost:9092"
export LOGPREP_INPUT="
input:
    kafka:
        type: confluentkafka_input
        topic: consumer
        offset_reset_policy: smallest
        kafka_config:
            bootstrap.servers: localhost:9092
            group.id: test"
class logprep.util.configuration.Configuration

the configuration class

version: str

It is optionally possible to set a version to your configuration file which can be printed via logprep run --version config/pipeline.yml. This has no effect on the execution of logprep but is used as hook for reloading the configuration. Defaults to unset.

config_refresh_interval: int | None

Configures the interval in seconds on which logprep should try to reload the configuration. If not configured, logprep won’t reload the configuration automatically. If configured the configuration will only be reloaded if the configuration version changes. If http errors occurs on configuration reload config_refresh_interval is set to a quarter of the current config_refresh_interval until a minimum of 5 seconds is reached. Defaults to None, which means that the configuration will not be refreshed.

Security Best Practice - Configuration - Refresh Interval

The refresh interval for the configuration shouldn’t be set too high in production environments. It is suggested to not set a value higher than 300 (5 min). That way configuration updates are propagated fairly quickly instead of once a day.

It should also be noted that a new configuration file will be read as long as it is a valid config. There is no further check to ensure credibility.

In case a new configuration could not be retrieved successfully and the config_refresh_interval is already reduced automatically to 5 seconds it should be noted that this could lead to a blocking behavior or a significant reduction in performance as logprep is often retrying to reload the configuration. Because of that ensure that the configuration endpoint is always available.

process_count: int

Number of logprep processes to start. Defaults to 1.

restart_count: int

Number of restarts before logprep exits. Defaults to 5. If this value is set to a negative number, logprep will always restart immediately.

Security Best Practice - Configuration - Restart Counter

The restart counter should be set to a value greater than 0 to ensure that logprep exits gracefully in case of repeated failures. This ensures that resources are released properly and any necessary cleanup is performed. Additionally the process will exit with an exit code unequal 0 to indicate that an error occurred. This is especially useful if you use an external orchestrator like k8s or systemd to manage the logprep process to get notified about failures via their respective monitoring and alerting systems.

timeout: float

Logprep tries to react to signals (like sent by CTRL+C) within the given time. The time taken for some processing steps is not always predictable, thus it is not possible to ensure that this time will be adhered to. However, Logprep reacts quickly for small values (< 1.0), but this requires more processing power. This can be useful for testing and debugging. Larger values (like 5.0) slow the reaction time down, but this requires less processing power, which makes in preferable for continuous operation. Defaults to 5.0.

logger: LoggerConfig

Logger configuration.

class LoggerConfig

The logger config class used in Configuration. The schema for this class is derived from the python logging module: https://docs.python.org/3/library/logging.config.html#dictionary-schema-details

LoggerConfig.level: str

The log level of the root logger. Defaults to INFO.

Security Best Practice - Configuration - Log-Level

The log level of the root logger should be set to INFO or higher in production environments to avoid exposing sensitive information in the logs.

LoggerConfig.format: str

The format of the log message as supported by the LogprepFormatter. Defaults to "%(asctime)-15s %(name)-10s %(levelname)-8s: %(message)s".

class LogprepFormatter

A custom formatter for logprep logging with additional attributes.

The Formatter can be initialized with a format string which makes use of knowledge of the LogRecord attributes - e.g. the default value mentioned above makes use of the fact that the user’s message and arguments are pre- formatted into a LogRecord’s message attribute. The available attributes are listed in the python documentation . Additionally, the formatter provides the following logprep specific attributes:

attribute

description

%(hostname)

(Logprep specific) The hostname of the machine where the log was emitted

LoggerConfig.datefmt: str

The date format of the log message. Defaults to "%Y-%m-%d %H:%M:%S".

LoggerConfig.loggers: dict

The loggers loglevel configuration. Defaults to:

root

INFO

filelock

ERROR

urllib3.connectionpool

ERROR

opensearch

ERROR

uvicorn

INFO

uvicorn.access

INFO

uvicorn.error

INFO

You can alter the log level of the loggers by adding them to the loggers mapping like in the example. Logprep opts out of hierarchical loggers and so it is possible to set the log level in general for all loggers in the root logger to INFO and then set the log level for specific loggers like Runner to DEBUG to get only DEBUG Messages from the Runner instance.

If you want to silence other loggers like py.warnings you can set the log level to ERROR here.

Example of a custom logger configuration
logger:
    level: ERROR
    format: "%(asctime)-15s %(hostname)-5s %(name)-10s %(levelname)-8s: %(message)s"
    datefmt: "%Y-%m-%d %H:%M:%S"
    loggers:
        "py.warnings": {"level": "ERROR"}
        "Runner": {"level": "DEBUG"}

Note

The effective log level of the root logger is controlled via logger.level. By default, logger.level is set to INFO if not configured explicitly. A value configured under loggers.root.level is currently ignored for the root logger, because it will always be overwritten by logger.level. Providing loggers.root.level therefore has no effect (except for triggering a warning during startup).

input: dict

Input connector configuration. Defaults to {}. For detailed configurations see Input.

output: dict

Output connector configuration. Defaults to {}. For detailed configurations see Output.

error_output: dict

Error output connector configuration. Defaults to {}. This is optional. If no error output is configured, logprep will not handle events that could not be processed by the pipeline, not parsed correctly by input connectors or not stored correctly by output connectors. For detailed configurations see Output.

pipeline: list[dict]

Pipeline configuration. Defaults to []. See Processors for a detailed overview on how to configure a pipeline.

metrics: MetricsConfig

Metrics configuration. Defaults to {"enabled": False, "port": 8000, "uvicorn_config": {}}.

The key uvicorn_config can be configured with any uvicorn config parameters. For further information see the uvicorn documentation.

Security Best Practice - Configuration - Metrics Configuration

Additionally to the below it is recommended to configure ssl on the metrics server endpoint

metrics:
  enabled: true
  port: 9000
  uvicorn_config:
    access_log: true
    server_header: false
    date_header: false
    workers: 1
profile_pipelines: bool

Start the profiler to profile the pipeline. Defaults to False. This can be used to profile logprep in near production environments to inspect performance bottlenecks.

error_backlog_size: int

Size of the error backlog. Defaults to 15000.

Security Best Practice - Configuration - Error Backlog Size

Depending on your environment ensure that this value adheres to your overall system resource limits. This can lead to OOM (Out Of Memory) errors if the backlog grows too large in failure situations. You have to reserve memory for this backlog to avoid DOS (Denial of Service) attacks by sending failing logs.