correlation - poppopjmp/spiderfoot GitHub Wiki
This section documents the Correlations feature, its rule engine, and how to write and manage correlation rules in SpiderFoot. This content is based on the latest updates as of 2025.
- New Correlation Rules: Added rules for Fofa, RocketReach, and ZoomEye exposed services/contacts, and more.
-
Advanced Analysis Methods: The engine now supports additional analysis methods such as
outlier,first_collection_only, andmatch_all_to_first_collection. - Improved Error Handling: If a rule contains syntax errors, SpiderFoot will now skip the invalid rule and continue loading others, providing detailed error messages at startup.
-
Rule ID and Filename: The
idfield in each rule must exactly match the filename (excluding.yaml). - Terminology Consistency: All references to rule components now use consistent terminology (e.g., "collections", "aggregation").
SpiderFoot’s goal is to automate OSINT collection and analysis to the greatest extent possible. Since its inception, SpiderFoot has heavily focused on automating OSINT collection and entity extraction, but the automation of common analysis tasks -- beyond some reporting and visualisations -- has been left entirely to the user. The meant that the strength of SpiderFoot's data collection capabilities has sometimes been its weakness since with so much data collected, users have often needed to export it and use other tools to weed out data of interest.
We started tackling this analysis gap with the launch of SpiderFoot in 2019 through the introduction of the "Correlations" feature. This feature was represented by some 30 "correlation rules" that ran with each scan, analyzing data and presenting results reflecting SpiderFoot's opinionated view on what may be important or interesting. Here are a few of those rules as examples:
- Hosts/IPs reported as malicious by multiple data sources
- Outlier web servers (can be an indication of shadow IT)
- Databases exposed on the Internet
- Open ports revealing software versions
- and many more.
With that said, let's get into what these rules look like and how to write one.
The rules themselves are written in YAML. Why YAML? It’s easy to read, write, allows for comments and is increasingly commonplace in many modern tools.
The simplest way to think of a SpiderFoot correlation rule is like a simple database query that consists of a few sections:
- Defining the rule itself (
id,versionandmetasections). - Stating what you'd like to extract from the scan results (
collectionssection). - Grouping that data in some way (
aggregationsection; optional). - Performing some analysis over that data in some way (
analysissection; optional). - Presenting the results (
headlinesection).
Here's an example rule that looks at SpiderFoot scan results for data revealing open TCP ports where the banner (the data returned upon connecting to the port) reports a software version. It does so by applying some regular expressions to the content of TCP_PORT_OPEN_BANNER data elements, filtering out some false positives and then grouping the results by the banner itself so that one correlation result is created per banner revealing a version:
id: open_port_version
version: 1
meta:
name: Open TCP port reveals version
description: >
A possible software version has been revealed on an open port. Such
information may reveal the use of old/unpatched software used by
the target.
risk: INFO
collections:
collect:
- method: exact
field: type
value: TCP_PORT_OPEN_BANNER
- method: regex
field: data
value: .*[0-9]\.[0-9].*
- method: regex
field: data
value: not .*Mime-Version.*
- method: regex
field: data
value: not .*HTTP/1.*
aggregation:
field: data
headline: "Software version revealed on open port: {data}"To show this in practice, we can run a simple scan against a target, in this case focusing on performing a port scan:
python3.9 ./sf.py -s www.binarypool.com -m sfp_dnsresolve,sfp_portscan_tcpExample output:
2022-04-06 08:14:58,476 [INFO] sflib : Scan [94EB5F0B] for 'www.binarypool.com' initiated.
...
sfp_portscan_tcp Open TCP Port Banner SSH-2.0-OpenSSH_7.2p2 Ubuntu-4ubuntu2.10
...
2022-04-06 08:15:23,110 [INFO] correlation : New correlation [open_port_version]: Software version revealed on open port: SSH-2.0-OpenSSH_7.2p2 Ubuntu-4ubuntu2.10
2022-04-06 08:15:23,244 [INFO] sflib : Scan [94EB5F0B] completed.
We can see above that a port was found to be open by the sfp_portscan_tcp module, and it happens to include a version. The correlation rule open_port_version picked this up and reported it. This is also visible in the web interface.
NOTE: Rules will only succeed if relevant data exists in your scan results in the first place. In other words, correlation rules analyze scan data, they don't collect data from targets.
SpiderFoot's correlation engine allows you to define rules in YAML to analyze and relate collected OSINT data. The engine and rule storage are fully backend-agnostic and robust.
- Correlation rules are written in YAML and stored in the
/correlationsdirectory. - The engine supports advanced analysis methods (
outlier,first_collection_only,match_all_to_first_collection). - Rule loading is robust: syntax errors in a rule will not prevent other rules from loading.
- The
idfield in each rule must match the filename. - Correlation results are stored in the database and can be queried via the API or web UI.
- Use the provided template and reference built-in rules for guidance.
- See the
/correlations/README.mdfor a full technical reference.
- Correlation results and configuration are stored in PostgreSQL.
SpiderFoot's correlation engine is designed for reliability, extensibility, and backend-agnostic operation. Here are key technical details and best practices for advanced users and developers:
- All correlation rules are stored as YAML files in the
/correlationsdirectory. - Rules are loaded at startup; syntax errors in one rule will not prevent others from loading. Errors are logged with full context.
- The
idfield in each rule must match the filename (excluding.yaml). This ensures traceability and prevents accidental rule duplication. - Rules can be enabled/disabled by adding/removing them from the directory or via the web UI (if supported).
- Correlation results, rule metadata, and configuration are stored in the main SpiderFoot PostgreSQL database.
- All upsert/replace operations use PostgreSQL-native helpers for correct behavior.
- Schema creation and migrations are idempotent. Unique constraints and composite keys are enforced where required.
- The correlation engine is robust to schema changes and will automatically migrate or update tables as needed.
- Correlation results are available in the web UI, via the REST API, and can be exported for further analysis.
- Results include references to the rule ID, scan instance, affected entities, and a human-readable headline.
- You can use the API to filter, search, and aggregate correlation results for reporting or integration with other tools.
- Start with the provided template and reference built-in rules for best practices.
- Use the
collectionssection to extract relevant data,aggregationto group, andanalysisfor advanced logic. - Test new rules on sample scans and review the logs for errors or unexpected results.
- Use the
/correlations/README.mdfor a full technical reference and advanced features.
- All correlation engine storage and queries use PostgreSQL. Placeholders, upserts, and type mapping are handled natively.
- Connection pooling is recommended for high concurrency.
- Always back up your database before adding or modifying correlation rules in production.
- If a rule fails to load, check the logs for detailed error messages (including YAML syntax and schema issues).
- If correlation results are missing, ensure your scan data contains the required event types and fields referenced by your rules.
- For database errors, see the Configuration Guide for backend-specific troubleshooting.
For more advanced usage, see the Correlation Analysis Guide and the Developer Guide.