StringSplitter

This presentation’s goal it to introduce the features of the StringSplitter and how to configure it.

The challenges

  • I want to split strings of varying length contained in a source field

given preprocessed log entry:

[9]:
document = {
    "ip_addresses": "192.168.5.1, 10.10.2.1, fe80::, 127.0.0.1"
}

Create rules and processor

create the rules:

[10]:
import sys
sys.path.append("../../../../../")

from logprep.processor.string_splitter.rule import StringSplitterRule
rules_definitions = [
    {
        "filter": "ip_addresses",
        "string_splitter": {
            "source_fields": ["ip_addresses"],
            "target_field": "ip_address_list"
        },
    }
]
rules = [StringSplitterRule.create_from_dict(rule_dict) for rule_dict in rules_definitions]
rules
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[10], line 4
      1 import sys
      2 # sys.path.append("../../../../../")
----> 4 from logprep.processor.string_splitter.rule import StringSplitterRule
      5 rules_definitions = [
      6     {
      7         "filter": "ip_addresses",
   (...)
     12     }
     13 ]
     14 rules = [StringSplitterRule._create_from_dict(rule_dict) for rule_dict in rules_definitions]

ModuleNotFoundError: No module named 'logprep.processor.string_splitter'

create the processor config:

[ ]:
processor_config = {
    "allmighty_string_splitter": {
        "type": "string_splitter",
        "rules": ["/dev"],
    }
}

create the processor with the factory:

[ ]:
from logging import getLogger
from logprep.factory import Factory

logger = getLogger()

processor = Factory.create(processor_config)
processor

string_splitter

load rules to processor

[ ]:
for rule in rules:
    processor._rule_tree.add_rule(rule)

processor.rules
[filter="ip_addresses", StringSplitterRule.Config(description='', regex_fields=[], tests=[], tag_on_failure=['_string_splitter_failure'], source_fields=['ip_addresses'], target_field='ip_addresses', delete_source_fields=False, overwrite_target=True, extend_target_list=False, delimeter=' ')]

Process event

[ ]:
from copy import deepcopy

mydocument = deepcopy(document)
processor.process(mydocument)

Check Results

[ ]:
document
{'ip_addresses': '192.168.5.1, 10.10.2.1, fe80::, 127.0.0.1'}
[ ]:
mydocument
{'ip_addresses': ['192.168.5.1,', '10.10.2.1,', 'fe80::,', '127.0.0.1']}