sfp_advanced_correlation - poppopjmp/spiderfoot GitHub Wiki

sfp_advanced_correlation - Advanced Data Correlation Engine

Overview

The Advanced Correlation Engine provides sophisticated data correlation and entity resolution capabilities. This module uses graph algorithms, temporal analysis, and machine learning techniques to identify relationships between entities across different data sources and platforms.

Features

Cross-Platform Identity Resolution

  • Username Correlation: Match usernames across social platforms
  • Email Address Linking: Connect email addresses to multiple accounts
  • Name Matching: Fuzzy matching for real names and aliases
  • Behavioral Patterns: Identify users by activity patterns

Temporal Pattern Analysis

  • Time-Series Correlation: Detect synchronized activities
  • Event Clustering: Group related events by time windows
  • Behavioral Rhythms: Identify user activity patterns
  • Anomaly Detection: Spot unusual temporal patterns

Geospatial Intelligence

  • Location Clustering: Group events by geographic proximity
  • Movement Patterns: Track entity movement over time
  • Distance Analysis: Calculate Haversine distances
  • Regional Correlations: Connect activities by location

Entity Relationship Mapping

  • Graph Construction: Build entity relationship graphs
  • Centrality Analysis: Identify key entities in networks
  • Community Detection: Find entity clusters and groups
  • Confidence Scoring: Quantify relationship strength

Configuration

Correlation Settings

[advanced_correlation]
# Enable correlation engine
correlation_enabled = True

# Confidence threshold for matches (0.0-1.0)
confidence_threshold = 0.7

# Time window for temporal correlation (hours)
temporal_window_hours = 24

# Maximum entities to track
max_entities = 10000

Geospatial Configuration

# Enable geospatial clustering
geospatial_enabled = True

# Clustering distance in kilometers
clustering_distance_km = 10.0

# Minimum cluster size
min_cluster_size = 3

Entity Resolution

# Enable entity resolution
entity_resolution_enabled = True

# String similarity threshold
similarity_threshold = 0.8

# Enable cross-platform correlation
cross_platform_correlation = True

Supported Event Types

Input Events

  • USERNAME
  • EMAILADDR
  • SOCIAL_MEDIA
  • GEOINFO
  • HUMAN_NAME
  • PHONE_NUMBER
  • IP_ADDRESS
  • DOMAIN_NAME

Output Events

  • CORRELATION_MATCH
  • ENTITY_RELATIONSHIP
  • PATTERN_DETECTION
  • TEMPORAL_ANOMALY
  • GEOSPATIAL_CLUSTER
  • IDENTITY_RESOLUTION

Correlation Algorithms

String Similarity Metrics

  • Levenshtein Distance: Character-level differences
  • Jaro-Winkler: Optimized for names and short strings
  • Soundex: Phonetic matching for name variations
  • N-gram Analysis: Token-based similarity

Graph Algorithms

  • Centrality Measures: Betweenness, closeness, eigenvector
  • Community Detection: Louvain algorithm implementation
  • Path Analysis: Shortest paths between entities
  • Clustering Coefficient: Network density analysis

Temporal Analysis

  • Time Windows: Configurable correlation periods
  • Event Frequency: Activity pattern recognition
  • Rhythm Analysis: Periodic behavior detection
  • Anomaly Scoring: Statistical deviation detection

Usage Examples

Cross-Platform Identity Investigation

python sf.py -s [email protected] -t EMAILADDR -m sfp_social,sfp_twitter,sfp_tiktok_osint,sfp_advanced_correlation

Temporal Pattern Analysis

python sf.py -s target_domain.com -t DOMAIN_NAME -m sfp_dnsresolve,sfp_ssl,sfp_advanced_correlation

Geospatial Correlation

python sf.py -s 192.168.1.0/24 -t NETBLOCK -m sfp_geoip,sfp_portscan_tcp,sfp_advanced_correlation

Correlation Rules Integration

The module works with YAML-based correlation rules:

TikTok User Correlation

name: "Cross-Platform TikTok Identity Correlation"
description: "Correlates TikTok users with other social platforms"
triggers:
  - event_type: "TIKTOK_PROFILE"
    conditions:
      - field: "username"
        operator: "matches_pattern"
        value: "social_username_pattern"

Behavioral Pattern Detection

name: "Suspicious Activity Pattern"
description: "Detects coordinated suspicious activities"
triggers:
  - event_type: "CORRELATION_MATCH"
    conditions:
      - field: "confidence_score"
        operator: "greater_than"
        value: 0.8

Performance Optimization

Caching Strategy

  • Entity Cache: In-memory entity storage with TTL
  • Relationship Cache: Pre-computed relationship graphs
  • Pattern Cache: Cached pattern recognition results
  • Query Optimization: Efficient database queries

Scalability Features

  • Batch Processing: Process multiple entities simultaneously
  • Incremental Updates: Update graphs without full rebuilding
  • Memory Management: Automatic cleanup of old entities
  • Parallel Processing: Multi-threaded correlation analysis

Integration with Other Modules

Recommended Combinations

# Social media investigation with correlation
-m sfp_twitter,sfp_tiktok_osint,sfp_linkedin,sfp_advanced_correlation

# Blockchain analysis with entity correlation
-m sfp_blockchain_analytics,sfp_advanced_correlation

# Email investigation with pattern detection
-m sfp_hunter,sfp_haveibeen,sfp_emailrep,sfp_advanced_correlation

Visualization and Reporting

Graph Visualization

  • Entity Graphs: Visual representation of relationships
  • Cluster Maps: Geographic and logical clustering
  • Timeline Views: Temporal correlation visualization
  • Confidence Heatmaps: Relationship strength indicators

Correlation Reports

  • Entity Profiles: Comprehensive entity summaries
  • Relationship Analysis: Connection strength assessment
  • Pattern Summaries: Detected behavioral patterns
  • Anomaly Reports: Unusual activity identification

Advanced Features

Machine Learning Integration

  • Pattern Recognition: ML-based pattern detection
  • Similarity Learning: Improved string matching
  • Anomaly Detection: Statistical and ML-based detection
  • Predictive Analysis: Relationship prediction

Custom Correlation Rules

  • Rule Engine: YAML-based rule definitions
  • Custom Triggers: User-defined correlation conditions
  • Weighted Scoring: Configurable confidence calculations
  • Rule Validation: Syntax and logic verification

Security and Privacy

Data Protection

  • Anonymization: Optional PII anonymization
  • Retention Policies: Configurable data cleanup
  • Access Controls: Restricted correlation data access
  • Audit Logging: Comprehensive activity logging

Compliance

  • GDPR Compliance: Privacy-by-design implementation
  • Data Minimization: Only necessary data correlation
  • Consent Management: User consent tracking
  • Right to Erasure: Data deletion capabilities

Troubleshooting

Common Issues

  1. High Memory Usage: Reduce max_entities or enable cleanup
  2. Slow Correlation: Adjust confidence thresholds
  3. False Positives: Fine-tune similarity thresholds
  4. Missing Correlations: Check input data quality

Performance Tuning

# Optimize for speed
confidence_threshold = 0.8
temporal_window_hours = 12
max_entities = 5000

# Optimize for accuracy
confidence_threshold = 0.6
temporal_window_hours = 48
max_entities = 20000

For more information on correlation techniques, see the Advanced Analytics Guide.

⚠️ **GitHub.com Fallback** ⚠️