LEX (Log EXtractor) Documentation - opelbn/SPIF GitHub Wiki
LEX transforms Zeek, GCP, and NetFlow logs into structured CSV data for analysis or machine learning, with flexible field extraction and suspicion labeling—all in a fast, parallelized workflow. As part of the Simplified Procedural Inference Framework (SPIF), it bridges raw log data to downstream preprocessing and model training, making it ideal for network security analysts, IoT researchers, and cloud auditors.
-
Supported Log Types: Processes Zeek (
.log
,.labeled
), GCP (.jsonl
), and NetFlow (.csv
) logs. - Field Profiling: Discovers available fields dynamically from input logs.
- Field Extraction: Extracts user-specified fields into CSV, defaulting to all if none are selected.
- Suspicion Labeling: Tags entries as suspicious (1) or normal (0) using default or custom JSON profiles.
-
No-Label Mode: Extracts fields without applying suspicion labels (
--no-label
). - Parallel Processing: Speeds up large datasets with multi-threaded file handling.
- NetFlow Aggregation: Aggregates stats by pair, source, or port within each file.
- C++17 Compiler: GCC, Clang, or MSVC (with threading support).
- Dependencies:
- SPIF Context: Part of the SPIF repo; see SPIF Installation for full setup.
- Place
cxxopts.hpp
andjson.hpp
in the project directory. - Compile:
g++ -std=c++17 -I. main.cpp logslice.cpp -o LEX -pthread
- Output:
LEX
(Linux) orLEX.exe
(Windows).
LEX [options]
Option | Description | Default | Example |
---|---|---|---|
-i, --input |
Input directory or file (required) | N/A | -i logs/zeek |
-o, --output |
Output directory or CSV file | N/A | -o output.csv |
Option | Description | Default | Example |
---|---|---|---|
-t, --log-type |
Log type (zeek , gcp , netflow ) |
zeek |
-t gcp |
-a, --agg-type |
NetFlow aggregation (pair , source , port ) |
pair |
-a port |
-f, --profile-file |
Suspicion profile JSON file | None | -f profile.json |
-p, --profile |
Profile fields only, no processing | False | -p |
-n, --no-label |
Extract fields without labeling | False | -n |
-h, --help |
Display help message | N/A | -h |
Option | Description | Default | Example |
---|---|---|---|
--<field> |
Extract specific field (case-sensitive) | All fields | --ts --duration |
-
Note: Available fields depend on the log type and input data. Use
-p
to list them (e.g.,ts
,protoPayload.methodName
,srcaddr
). Reserved options (e.g.,input
) are renamed with_x
(e.g.,--input_x
).
A JSON file customizes suspicion criteria. Without a profile, defaults apply unless --no-label
is used.
{
"zeek": {
"malicious_labels": ["malicious", "attack"],
"duration_threshold": 3600.0
},
"gcp": {
"suspicious_methods": ["google.iam.admin.v1.CreateServiceAccount"],
"non_org_domains": ["@external.com"],
"bytes_threshold": 1000000
},
"netflow": {
"bytes_threshold": 10000000,
"packets_threshold": 5000,
"port_count_threshold": 10,
"suspicious_ports": [80, 443],
"rules": [{"field": "doctets", "op": ">", "value": 10000000}]
}
}
-
Zeek:
malicious_labels: {"malicious", "attack", "exploit"}
,duration_threshold: 3600.0
. -
GCP:
suspicious_methods: {"google.iam.admin.v1.CreateServiceAccount", "google.cloud.storage.v1.GetObject"}
,non_org_domains: {"@external.com"}
,bytes_threshold: 1000000
. -
NetFlow:
bytes_threshold: 10000000
,packets_threshold: 5000
,port_count_threshold: 10
,suspicious_ports: {}
.
-
Zeek: Label
1
iflabel
inmalicious_labels
orduration > duration_threshold
. -
GCP: Label
1
ifprotoPayload.methodName
insuspicious_methods
,authenticationInfo.principalEmail
innon_org_domains
or empty, orprotoPayload.requestSize > bytes_threshold
. -
NetFlow: Label
1
iftotal_bytes > bytes_threshold
,total_packets > packets_threshold
,dst_ports.size() > port_count_threshold
,dstport
insuspicious_ports
, or arule
matches.
-
Structure: CSV with headers matching selected fields, plus
label
(unless--no-label
). -
Naming:
- Directory output:
<parent>_<stem>_features.csv
(e.g.,test_zeek_conn_features.csv
). - Single file: Specified CSV path (e.g.,
output.csv
).
- Directory output:
-
Example (Zeek with
--no-label
):ts,duration 1.0,5.0 2.0,2.0
-
Example (GCP with labeling):
protoPayload.methodName,label google.iam.admin.v1.CreateServiceAccount,1 google.cloud.storage.v1.ListBuckets,0
LEX -i logs/zeek -t zeek -p
Output:
Available fields:
--ts
--duration
--label
...
LEX -i logs/zeek -t zeek -o zeek.csv --ts --duration
Output: zeek.csv
ts,duration
1.0,5.0
2.0,2.0
LEX -i logs/gcp -t gcp -o gcp.csv --protoPayload.methodName --authenticationInfo.principalEmail
Output: gcp.csv
protoPayload.methodName,authenticationInfo.principalEmail,label
google.iam.admin.v1.CreateServiceAccount,[email protected],1
google.cloud.storage.v1.ListBuckets,[email protected],0
LEX -i logs/netflow -t netflow -o netflow -a port -f profile.json --srcaddr --dstport --doctets
Output: netflow/flow1_features.csv
srcaddr,dstport,doctets,label
192.168.1.1,80,15000000,1
10.0.0.1,443,5000,0
LEX fits into the SPIF pipeline:
-
Extract: Use
TEX
for packets or LEX for logs. -
Preprocess: Convert CSVs to
.npy
withzeek_preprocessor
. -
Train: Feed
.npy
files toTrain_XGB.py
for model training.
-
Field Names: Case-sensitive; use
-p
to verify exact names. - Aggregation: NetFlow stats are per-file; global aggregation isn’t supported yet.
- Performance: Parallel processing scales with file count and CPU cores.
- Future Features: Log tailing and alternative output streams (e.g., stdout) are planned.
-
"No fields found": Check input path and log type compatibility (e.g.,
.log
for Zeek). - "Could not open output": Ensure directory exists or use a full CSV path.
-
Unrecognized option: Verify field names with
-p
; options are case-sensitive.
Submit issues or PRs at [SPIF GitHub TBD] to add features or improve docs.