LEX (Log EXtractor) Documentation - opelbn/SPIF GitHub Wiki

Log EXtractor (LEX)

LEX transforms Zeek, GCP, and NetFlow logs into structured CSV data for analysis or machine learning, with flexible field extraction and suspicion labeling—all in a fast, parallelized workflow. As part of the Simplified Procedural Inference Framework (SPIF), it bridges raw log data to downstream preprocessing and model training, making it ideal for network security analysts, IoT researchers, and cloud auditors.

Features

  • Supported Log Types: Processes Zeek (.log, .labeled), GCP (.jsonl), and NetFlow (.csv) logs.
  • Field Profiling: Discovers available fields dynamically from input logs.
  • Field Extraction: Extracts user-specified fields into CSV, defaulting to all if none are selected.
  • Suspicion Labeling: Tags entries as suspicious (1) or normal (0) using default or custom JSON profiles.
  • No-Label Mode: Extracts fields without applying suspicion labels (--no-label).
  • Parallel Processing: Speeds up large datasets with multi-threaded file handling.
  • NetFlow Aggregation: Aggregates stats by pair, source, or port within each file.

Installation

Prerequisites

  • C++17 Compiler: GCC, Clang, or MSVC (with threading support).
  • Dependencies:
  • SPIF Context: Part of the SPIF repo; see SPIF Installation for full setup.

Build

  1. Place cxxopts.hpp and json.hpp in the project directory.
  2. Compile:
    g++ -std=c++17 -I. main.cpp logslice.cpp -o LEX -pthread
  3. Output: LEX (Linux) or LEX.exe (Windows).

Usage

LEX [options]

Options

Input/Output

Option Description Default Example
-i, --input Input directory or file (required) N/A -i logs/zeek
-o, --output Output directory or CSV file N/A -o output.csv

Configuration

Option Description Default Example
-t, --log-type Log type (zeek, gcp, netflow) zeek -t gcp
-a, --agg-type NetFlow aggregation (pair, source, port) pair -a port
-f, --profile-file Suspicion profile JSON file None -f profile.json
-p, --profile Profile fields only, no processing False -p
-n, --no-label Extract fields without labeling False -n
-h, --help Display help message N/A -h

Field Selection

Option Description Default Example
--<field> Extract specific field (case-sensitive) All fields --ts --duration
  • Note: Available fields depend on the log type and input data. Use -p to list them (e.g., ts, protoPayload.methodName, srcaddr). Reserved options (e.g., input) are renamed with _x (e.g., --input_x).

Suspicion Profile

A JSON file customizes suspicion criteria. Without a profile, defaults apply unless --no-label is used.

Structure

{
    "zeek": {
        "malicious_labels": ["malicious", "attack"],
        "duration_threshold": 3600.0
    },
    "gcp": {
        "suspicious_methods": ["google.iam.admin.v1.CreateServiceAccount"],
        "non_org_domains": ["@external.com"],
        "bytes_threshold": 1000000
    },
    "netflow": {
        "bytes_threshold": 10000000,
        "packets_threshold": 5000,
        "port_count_threshold": 10,
        "suspicious_ports": [80, 443],
        "rules": [{"field": "doctets", "op": ">", "value": 10000000}]
    }
}

Defaults

  • Zeek: malicious_labels: {"malicious", "attack", "exploit"}, duration_threshold: 3600.0.
  • GCP: suspicious_methods: {"google.iam.admin.v1.CreateServiceAccount", "google.cloud.storage.v1.GetObject"}, non_org_domains: {"@external.com"}, bytes_threshold: 1000000.
  • NetFlow: bytes_threshold: 10000000, packets_threshold: 5000, port_count_threshold: 10, suspicious_ports: {}.

Criteria

  • Zeek: Label 1 if label in malicious_labels or duration > duration_threshold.
  • GCP: Label 1 if protoPayload.methodName in suspicious_methods, authenticationInfo.principalEmail in non_org_domains or empty, or protoPayload.requestSize > bytes_threshold.
  • NetFlow: Label 1 if total_bytes > bytes_threshold, total_packets > packets_threshold, dst_ports.size() > port_count_threshold, dstport in suspicious_ports, or a rule matches.

Output Format

  • Structure: CSV with headers matching selected fields, plus label (unless --no-label).
  • Naming:
    • Directory output: <parent>_<stem>_features.csv (e.g., test_zeek_conn_features.csv).
    • Single file: Specified CSV path (e.g., output.csv).
  • Example (Zeek with --no-label):
    ts,duration
    1.0,5.0
    2.0,2.0
    
  • Example (GCP with labeling):
    protoPayload.methodName,label
    google.iam.admin.v1.CreateServiceAccount,1
    google.cloud.storage.v1.ListBuckets,0
    

Examples

1. Profile Available Fields

LEX -i logs/zeek -t zeek -p

Output:

Available fields:
  --ts
  --duration
  --label
  ...

2. Basic Extraction (Zeek)

LEX -i logs/zeek -t zeek -o zeek.csv --ts --duration

Output: zeek.csv

ts,duration
1.0,5.0
2.0,2.0

3. Extraction with Labeling (GCP)

LEX -i logs/gcp -t gcp -o gcp.csv --protoPayload.methodName --authenticationInfo.principalEmail

Output: gcp.csv

protoPayload.methodName,authenticationInfo.principalEmail,label
google.iam.admin.v1.CreateServiceAccount,[email protected],1
google.cloud.storage.v1.ListBuckets,[email protected],0

4. NetFlow with Aggregation and Custom Profile

LEX -i logs/netflow -t netflow -o netflow -a port -f profile.json --srcaddr --dstport --doctets

Output: netflow/flow1_features.csv

srcaddr,dstport,doctets,label
192.168.1.1,80,15000000,1
10.0.0.1,443,5000,0

SPIF Integration

LEX fits into the SPIF pipeline:

  1. Extract: Use TEX for packets or LEX for logs.
  2. Preprocess: Convert CSVs to .npy with zeek_preprocessor.
  3. Train: Feed .npy files to Train_XGB.py for model training.

Notes

  • Field Names: Case-sensitive; use -p to verify exact names.
  • Aggregation: NetFlow stats are per-file; global aggregation isn’t supported yet.
  • Performance: Parallel processing scales with file count and CPU cores.
  • Future Features: Log tailing and alternative output streams (e.g., stdout) are planned.

Troubleshooting

  • "No fields found": Check input path and log type compatibility (e.g., .log for Zeek).
  • "Could not open output": Ensure directory exists or use a full CSV path.
  • Unrecognized option: Verify field names with -p; options are case-sensitive.

Contributing

Submit issues or PRs at [SPIF GitHub TBD] to add features or improve docs.

License

Apache License 2.0

⚠️ **GitHub.com Fallback** ⚠️