Log EXtractor (LEX)

LEX transforms Zeek, GCP, and NetFlow logs into structured CSV data for analysis or machine learning, with flexible field extraction and suspicion labeling—all in a fast, parallelized workflow. As part of the Simplified Procedural Inference Framework (SPIF), it bridges raw log data to downstream preprocessing and model training, making it ideal for network security analysts, IoT researchers, and cloud auditors.

Features

Supported Log Types: Processes Zeek (.log, .labeled), GCP (.jsonl), and NetFlow (.csv) logs.
Field Profiling: Discovers available fields dynamically from input logs.
Field Extraction: Extracts user-specified fields into CSV, defaulting to all if none are selected.
Suspicion Labeling: Tags entries as suspicious (1) or normal (0) using default or custom JSON profiles.
No-Label Mode: Extracts fields without applying suspicion labels (--no-label).
Parallel Processing: Speeds up large datasets with multi-threaded file handling.
NetFlow Aggregation: Aggregates stats by pair, source, or port within each file.

Installation

Prerequisites

C++17 Compiler: GCC, Clang, or MSVC (with threading support).
Dependencies:
- cxxopts.hpp (GitHub).
- nlohmann/json.hpp (GitHub).
SPIF Context: Part of the SPIF repo; see SPIF Installation for full setup.

Build

Place cxxopts.hpp and json.hpp in the project directory.

Compile:

g++ -std=c++17 -I. main.cpp logslice.cpp -o LEX -pthread

Output: LEX (Linux) or LEX.exe (Windows).

Usage

LEX [options]

Options

Input/Output

Option	Description	Default	Example
`-i, --input`	Input directory or file (required)	N/A	`-i logs/zeek`
`-o, --output`	Output directory or CSV file	N/A	`-o output.csv`

Configuration

Option	Description	Default	Example
`-t, --log-type`	Log type (`zeek`, `gcp`, `netflow`)	`zeek`	`-t gcp`
`-a, --agg-type`	NetFlow aggregation (`pair`, `source`, `port`)	`pair`	`-a port`
`-f, --profile-file`	Suspicion profile JSON file	None	`-f profile.json`
`-p, --profile`	Profile fields only, no processing	False	`-p`
`-n, --no-label`	Extract fields without labeling	False	`-n`
`-h, --help`	Display help message	N/A	`-h`

Field Selection

Option	Description	Default	Example
`--<field>`	Extract specific field (case-sensitive)	All fields	`--ts --duration`

Note: Available fields depend on the log type and input data. Use -p to list them (e.g., ts, protoPayload.methodName, srcaddr). Reserved options (e.g., input) are renamed with _x (e.g., --input_x).

Suspicion Profile

A JSON file customizes suspicion criteria. Without a profile, defaults apply unless --no-label is used.

Structure

{
    "zeek": {
        "malicious_labels": ["malicious", "attack"],
        "duration_threshold": 3600.0
    },
    "gcp": {
        "suspicious_methods": ["google.iam.admin.v1.CreateServiceAccount"],
        "non_org_domains": ["@external.com"],
        "bytes_threshold": 1000000
    },
    "netflow": {
        "bytes_threshold": 10000000,
        "packets_threshold": 5000,
        "port_count_threshold": 10,
        "suspicious_ports": [80, 443],
        "rules": [{"field": "doctets", "op": ">", "value": 10000000}]
    }
}

Defaults

Zeek: malicious_labels: {"malicious", "attack", "exploit"}, duration_threshold: 3600.0.
GCP: suspicious_methods: {"google.iam.admin.v1.CreateServiceAccount", "google.cloud.storage.v1.GetObject"}, non_org_domains: {"@external.com"}, bytes_threshold: 1000000.
NetFlow: bytes_threshold: 10000000, packets_threshold: 5000, port_count_threshold: 10, suspicious_ports: {}.

Criteria

Zeek: Label 1 if label in malicious_labels or duration > duration_threshold.
GCP: Label 1 if protoPayload.methodName in suspicious_methods, authenticationInfo.principalEmail in non_org_domains or empty, or protoPayload.requestSize > bytes_threshold.
NetFlow: Label 1 if total_bytes > bytes_threshold, total_packets > packets_threshold, dst_ports.size() > port_count_threshold, dstport in suspicious_ports, or a rule matches.

Output Format

Structure: CSV with headers matching selected fields, plus label (unless --no-label).
Naming:
- Directory output: <parent>_<stem>_features.csv (e.g., test_zeek_conn_features.csv).
- Single file: Specified CSV path (e.g., output.csv).
Example (Zeek with --no-label):
```
ts,duration
1.0,5.0
2.0,2.0
```

Example (GCP with labeling):

protoPayload.methodName,label
google.iam.admin.v1.CreateServiceAccount,1
google.cloud.storage.v1.ListBuckets,0

Examples

1. Profile Available Fields

LEX -i logs/zeek -t zeek -p

Output:

Available fields:
  --ts
  --duration
  --label
  ...

2. Basic Extraction (Zeek)

LEX -i logs/zeek -t zeek -o zeek.csv --ts --duration

Output: zeek.csv

ts,duration
1.0,5.0
2.0,2.0

3. Extraction with Labeling (GCP)

LEX -i logs/gcp -t gcp -o gcp.csv --protoPayload.methodName --authenticationInfo.principalEmail

Output: gcp.csv

protoPayload.methodName,authenticationInfo.principalEmail,label
google.iam.admin.v1.CreateServiceAccount,[email protected],1
google.cloud.storage.v1.ListBuckets,[email protected],0

4. NetFlow with Aggregation and Custom Profile

LEX -i logs/netflow -t netflow -o netflow -a port -f profile.json --srcaddr --dstport --doctets

Output: netflow/flow1_features.csv

srcaddr,dstport,doctets,label
192.168.1.1,80,15000000,1
10.0.0.1,443,5000,0

SPIF Integration

LEX fits into the SPIF pipeline:

Extract: Use TEX for packets or LEX for logs.
Preprocess: Convert CSVs to .npy with zeek_preprocessor.
Train: Feed .npy files to Train_XGB.py for model training.

Notes

Field Names: Case-sensitive; use -p to verify exact names.
Aggregation: NetFlow stats are per-file; global aggregation isn’t supported yet.
Performance: Parallel processing scales with file count and CPU cores.
Future Features: Log tailing and alternative output streams (e.g., stdout) are planned.

Troubleshooting

"No fields found": Check input path and log type compatibility (e.g., .log for Zeek).
"Could not open output": Ensure directory exists or use a full CSV path.
Unrecognized option: Verify field names with -p; options are case-sensitive.

Contributing

Submit issues or PRs at [SPIF GitHub TBD] to add features or improve docs.

License

Apache License 2.0

LEX (Log EXtractor) Documentation - opelbn/SPIF GitHub Wiki

Log EXtractor (LEX)

Features

Installation

Prerequisites

Build

Usage

Options

Input/Output

Configuration

Field Selection

Suspicion Profile

Structure

Defaults

Criteria

Output Format

Examples

1. Profile Available Fields

2. Basic Extraction (Zeek)

3. Extraction with Labeling (GCP)

4. NetFlow with Aggregation and Custom Profile

SPIF Integration

Notes

Troubleshooting

Contributing

License

⚠️ GitHub.com Fallback ⚠️

LEX (Log EXtractor) Documentation - opelbn/SPIF GitHub Wiki

Log EXtractor (LEX)

Features

Installation

Prerequisites

Build

Usage

Options

Input/Output

Configuration

Field Selection

Suspicion Profile

Structure

Defaults

Criteria

Output Format

Examples

1. Profile Available Fields

2. Basic Extraction (Zeek)

3. Extraction with Labeling (GCP)

4. NetFlow with Aggregation and Custom Profile

SPIF Integration

Notes

Troubleshooting

Contributing

License

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️