ARCHITECTURE - poppopjmp/spiderfoot GitHub Wiki
SpiderFoot v6.0.0 implements a microservices-only architecture:
- 23 containers behind a Traefik v3 reverse proxy with full observability, AI agents, Celery task processing, and React SPA frontend
- Native async I/O via aiohttp + aiodns for high-throughput scanner modules
- AI structured outputs via Pydantic schemas and OpenAI json_schema mode
- TypeScript SDK auto-generated from the OpenAPI spec via @hey-api/openapi-ts
- PEP 561 strict typing with mypy enforcement across the Python codebase
Note: Monolith mode was removed in v6.0.0 (Batches 34β36). SpiderFoot now requires Docker Compose or Kubernetes.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Traefik v3 Gateway (:443) β
β Auto-TLS Β· Rate limiting Β· Reverse proxy Β· Path routing β
ββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬ββββββββββββββββ¬βββββββββββββ€
β Frontend β REST API β Agents β Celery Flower β Grafana β
β React SPA β :8001 β :8100 β :5555 β :3000 β
ββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄ββββββββββββββββ΄βββββββββββββ€
β Task Processing β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β β Celery β β Celery β β Apache β β
β β Worker β β Beat β β Tika :9998 β β
β β (async) β β (scheduler)β β (document) β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Data Layer β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β β PostgreSQL β β Redis β β Qdrant β β MinIO β β
β β :5432 β β :6379 β β :6333 β β :9000/9001 β β
β β (Primary) β β (EventBus, β β (Vector β β (S3 Object β β
β β β β Cache, β β Search) β β Storage) β β
β β β β Broker) β β β β β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β LLM Gateway β
β ββββββββββββββ β
β β LiteLLM β Multi-provider proxy (OpenAI, Anthropic, Ollama) β
β β :4000 β Cost tracking Β· Model routing Β· Redis cache β
β ββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Observability Pipeline β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β β Vector.dev β β Loki β β Prometheus β β Jaeger β β
β β :8686 β β :3100 β β :9090 β β :16686 β β
β β :4317/4318 β β (Log aggr) β β (Metrics) β β (Tracing) β β
β β :9598 β β β β β β β β
β β (Telemetry β ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β β pipeline) β ββββββββββββββ β
β ββββββββββββββ β Grafana β Dashboards Β· Alerting Β· Data explorer β
β β :3000 β β
β ββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Sidecars & Infrastructure β
β ββββββββββββββ ββββββββββββββ βββββββββββββββββββ β
β β pg-backup β β minio-init β β Docker Socket β β
β β (cron) β β (one-shot) β β Proxy (Traefik) β β
β β β MinIO β β β β β β
β ββββββββββββββ ββββββββββββββ βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The spiderfoot/ package is organized into 23 domain sub-packages plus several top-level modules:
| Sub-package | Purpose | Key modules |
|---|---|---|
spiderfoot/agents/ |
AI analysis agents (LLM-powered) |
base, finding_validator, credential_analyzer, text_summarizer, report_generator, document_analyzer, threat_intel, service
|
spiderfoot/ai/ |
AI model abstraction and helpers | LLM client wrappers, prompt templates |
spiderfoot/api/ |
FastAPI application and routers | REST endpoints, GraphQL, middleware |
spiderfoot/auth/ |
Authentication and authorization | JWT, API key, RBAC, session management |
spiderfoot/config/ |
Configuration management |
constants, app_config, config_schema
|
spiderfoot/core/ |
Core engine and orchestration | Engine, scheduler, coordinator |
spiderfoot/correlation/ |
Correlation rule engine |
rule_executor, rule_loader, YAML rule processing |
spiderfoot/data_service/ |
Database abstraction layer | Local/HTTP/gRPC backends for scans, events |
spiderfoot/db/ |
PostgreSQL data access layer | Models, queries, migrations |
spiderfoot/dicts/ |
Data dictionaries and taxonomies | Event type mappings, category lists |
spiderfoot/ecosystem/ |
Ecosystem and plugin marketplace | Registry, versioning, discovery |
spiderfoot/enrichment/ |
Document enrichment pipeline |
converter, extractor, pipeline, service
|
spiderfoot/eventbus/ |
Pub/sub messaging | Memory/Redis/NATS backends |
spiderfoot/events/ |
Event types and processing |
event, event_relay, event_dedup, event_pipeline, event_taxonomy
|
spiderfoot/export/ |
Export service (JSON/CSV/STIX/SARIF) |
export_service, format adapters |
spiderfoot/iac/ |
Infrastructure-as-Code mapping | IaC resource detection and visualization |
spiderfoot/module_mgmt/ |
Module lifecycle management | Loader, registry, resolver (distinct from plugins/) |
spiderfoot/notifications/ |
Notification service | Slack, webhook, email dispatchers |
spiderfoot/observability/ |
Logging, metrics, auditing |
logger, metrics, structured_logging, audit_log, health, tracing
|
spiderfoot/ops/ |
Operational utilities | Health checks, maintenance tasks |
spiderfoot/plugins/ |
Plugin base classes |
plugin, modern_plugin
|
spiderfoot/recon/ |
Reconnaissance utilities | Active recon helpers and tool wrappers |
spiderfoot/reporting/ |
Report generation |
report_generator, export_service, report_formatter, visualization_service
|
spiderfoot/research/ |
Research and investigation helpers | Research workflow utilities |
spiderfoot/scan/ |
Scan lifecycle and orchestration |
scan_state, scan_coordinator, scan_scheduler, scan_queue, scan_workflow
|
spiderfoot/scan_service/ |
Scan service layer | Service wrapper around scan orchestration |
spiderfoot/security/ |
Authentication, CSRF, middleware |
auth, csrf_protection, security_middleware, security_logging
|
spiderfoot/services/ |
External service integrations |
cache_service, dns_service, http_service, grpc_service, websocket_service, embedding_service
|
spiderfoot/sflib/ |
Legacy library shim | Backward-compat wrappers for old SpiderFoot god object |
spiderfoot/storage/ |
Storage backends | MinIO S3 client, file storage |
spiderfoot/tasks/ |
Celery task definitions | Async scan tasks, scheduled jobs |
spiderfoot/user_input/ |
User-defined input ingestion | service |
spiderfoot/webhooks/ |
Webhook handlers | Inbound/outbound webhook processing |
Top-level modules: celery_app.py, helpers.py, result_cache.py, retry.py, service_integration.py, service_registry.py, service_runner.py, target.py, threadpool.py, workspace.py
# Preferred: import from subpackage init
from spiderfoot.events import SpiderFootEvent
from spiderfoot.plugins import SpiderFootPlugin
from spiderfoot.config import SF_DATA_TYPES
# Also valid: import from specific module
from spiderfoot.events.event import SpiderFootEvent
from spiderfoot.scan.scan_state import SpiderFootScanState
# Top-level re-exports still work
from spiderfoot import SpiderFootEvent, SpiderFootPluginNote (v5.245.0): All backward-compatibility shim files in the
spiderfoot/root were removed. Code that used old paths likefrom spiderfoot.event import ...orfrom spiderfoot.plugin import ...must update to the subpackage paths shown above.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Traefik v3 Gateway (:443) β
β Auto-TLS Β· Rate limiting Β· Reverse proxy β
ββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββββ€
β Frontend β REST API β Agents β Celery β Tika β
β React SPAβ :8001 β :8100 β Workers/Beat β :9998 β
ββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββββ€
β Service Layer β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β βServiceReg. β βConfigSvc β β Metrics β β Structured β β
β β(DI Contain)β β(env+file) β β(Prometheus)β β Logging β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Core Services β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β βHttpService β β DnsService β βCacheServiceβ βDataService β β
β β(pooled HTTPβ β(DNS+cache) β β(Mem/File/ β β(PostgreSQL/β β
β β +proxy) β β β β Redis) β β gRPC) β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Execution Layer β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β β WorkerPool β β Scan β βCorrelation β β EventBus β β
β β(Thread/Procβ β Scheduler β β Service β β(Mem/Redis/ β β
β β pool) β β(priority Q)β β(auto+batch)β β NATS) β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β LLM Gateway β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β β LiteLLM β β AI Agents β β OTel β β
β β(multi-LLM β β (6 agents) β β Tracing β β
β β proxy) β β β β (Vectorβ β β
β ββββββββββββββ ββββββββββββββ β Jaeger) β β
β ββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Data Pipeline β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β βVector.dev β β Loki β β Prometheus β β Grafana β β
β β(logs/tracesβ β (log aggr) β β (metrics) β β(dashboards)β β
β β /metrics) β β β β β β β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Module Layer β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
β β SpiderFootPlugin β βSpiderFootModernPluginβ β
β β (legacy, 309 β β(service-aware, new β β
β β modules) β β modules) β β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Abstracted publish/subscribe messaging with three backends:
-
Memory (default): In-process
asyncio.Queue-based, zero config - Redis Streams: Consumer groups, at-least-once delivery
- NATS JetStream: High-throughput, durable messaging
Topics follow a dot-notation pattern: scan.started, scan.completed,
module.event.new, etc.
Database abstraction layer supporting:
-
Local: Direct PostgreSQL via
SpiderFootDb - HTTP: Remote REST API for microservices mode
- gRPC: High-performance binary protocol
Provides CRUD for scans, events, configs, and correlations.
Extracted HTTP client from the legacy SpiderFoot god object:
- Connection pooling
- SOCKS4/5 and HTTP proxy support
- Configurable timeouts and user-agents
- TLS certificate parsing
- Google/Bing API pagination helpers
Extracted DNS resolver with built-in TTL cache:
- A/AAAA/MX/TXT/NS/PTR record types
- Wildcard detection
- Zone transfer checks
- Configurable resolvers
Three-tier caching:
- Memory: LRU eviction, TTL-based expiry
- File: SHA-224 hashed filenames, persistent
- Redis: Distributed cache for microservices
Centralized configuration management:
- 40+ environment variable mappings
- Type coercion (bool, int, float, str)
- Validation rules (min/max, choices, required)
- Hot-reload with watcher callbacks
- Snapshot isolation for per-scan configs
Dependency injection container:
- Lazy factory pattern (services created on first access)
- Well-known service constants
- Thread-safe singleton
-
ServiceMixinfor convenient property access
Module execution infrastructure:
- Thread pool (default) or process pool strategies
- Per-module workers with event queues
- Health monitoring with automatic restart
- Graceful shutdown with drain
Scan lifecycle management:
- Priority queue (CRITICAL > HIGH > NORMAL > LOW)
- Concurrent scan limiting
- Pause/resume/abort controls
- Timeout detection
- Progress tracking
Standalone correlation engine:
- Wraps the existing
RuleExecutorpipeline - EventBus subscription for auto-trigger on scan completion
- Async queue for batch processing
- Result caching and callbacks
- Prometheus metrics integration
Unified request routing:
- Circuit breaker per downstream service
- Token-bucket rate limiting per client
- Microservices-only mode (monolith removed in v6.0.0)
- FastAPI router integration
- System status aggregation
Zero-dependency Prometheus-compatible instrumentation:
- Counter, Gauge, Histogram types
- Label support for dimensional metrics
-
/metricsendpoint in Prometheus text format - Pre-defined metrics for scans, events, HTTP, DNS, cache
Data pipeline to external systems:
- Batched HTTP posting to Vector.dev
- Separate channels for events, logs, metrics
- Configurable transforms and routing
- Elasticsearch/S3/file sinks via Vector config
Inter-service communication:
- Protobuf service contracts (
proto/spiderfoot.proto) - JSON-over-HTTP fallback when gRPC unavailable
- ServiceDirectory for environment-based endpoint discovery
- Health checking
The docker-compose.yml defines 23 containers across 8 profiles:
| Container | Profile | Image | Purpose |
|---|---|---|---|
| sf-postgres | (core) | postgres:15-alpine | Primary database (:5432) |
| sf-redis | (core) | redis:7-alpine | Event bus + cache + Celery broker (:6379) |
| sf-api | (core) | spiderfoot-micro | REST API + GraphQL (:8001) |
| sf-celery-worker | (core) | spiderfoot-micro | Celery distributed task workers |
| sf-frontend-ui | (core) | spiderfoot-frontend | React SPA served by Nginx (:80) |
| sf-celery-worker-active | scan |
spiderfoot-active | Active recon worker (nmap, nuclei, httpx, β¦) |
| sf-traefik | proxy |
traefik:v3 | Reverse proxy + auto-TLS + routing (:443) |
| sf-docker-proxy | proxy |
tecnativa/docker-socket-proxy | Secure Docker API access for Traefik |
| sf-minio | storage |
minio/minio | S3-compatible object storage (:9000/9001) |
| sf-minio-init | storage |
minio/mc | One-shot bucket provisioner |
| sf-qdrant | storage |
qdrant/qdrant | Vector similarity search (:6333) |
| sf-tika | storage |
apache/tika | Document parsing β PDF, DOCX, XLSX (:9998) |
| sf-pg-backup | storage |
postgres:15-alpine | Scheduled PG backup β MinIO |
| sf-vector | monitor |
timberio/vector | Telemetry pipeline (:8686/:4317/:9598) |
| sf-loki | monitor |
grafana/loki | Log aggregation (:3100) |
| sf-grafana | monitor |
grafana/grafana | Dashboards & visualization (:3000) |
| sf-prometheus | monitor |
prom/prometheus | Metrics collection (:9090) |
| sf-jaeger | monitor |
jaegertracing/jaeger | Distributed tracing (:16686) |
| sf-agents | ai |
spiderfoot-micro | AI analysis agents β 6 agents (:8100) |
| sf-litellm | ai |
ghcr.io/berriai/litellm | Unified LLM gateway (:4000) |
| sf-celery-beat | scheduler |
spiderfoot-micro | Celery periodic task scheduler |
| sf-flower | scheduler |
spiderfoot-micro | Celery monitoring dashboard (:5555) |
| sf-keycloak | sso |
keycloak | OIDC / SAML identity provider (:9080) |
- sf-frontend: Traefik β Frontend/API (external-facing)
- sf-backend: All services β PostgreSQL/Redis/Qdrant/MinIO (internal)
| Volume | Container | Mount Path |
|---|---|---|
| sf-postgres-data | sf-postgres | /var/lib/postgresql/data |
| sf-redis-data | sf-redis | /data |
| sf-qdrant-data | sf-qdrant | /qdrant/storage |
| sf-qdrant-snapshots | sf-qdrant | /qdrant/snapshots |
| sf-vector-data | sf-vector | /var/lib/vector |
| sf-minio-data | sf-minio | /data |
| sf-logs | sf-api | /app/logs |
| traefik-logs | sf-traefik | /var/log/traefik |
The web interface is a modern React SPA built with TypeScript, Vite, and Tailwind CSS. It features a dark theme with cyan accents, responsive layout, and real-time scan updates via GraphQL subscriptions. All HTML injection points are sanitized with DOMPurify, and ESLint with typescript-eslint enforces strict type safety (zero any types).
Key pages and features:
| Page | Description |
|---|---|
| Dashboard | Active scans, event totals, risk distribution, recent activity |
| New Scan | Target input, module category selection, scan configuration |
| Scans | Paginated scan list with status, target, events, duration |
| Scan Detail | 8-tab view: Summary, Browse, Graph, GeoMap, Correlations, AI Report, Scan Settings, Log |
| Workspaces | Multi-scan campaign management with notes and analytics |
| Settings | Global settings, module API keys, notification preferences |
| Agents | AI agent status monitoring and analysis results |
| Users / SSO | User management and SSO configuration (OIDC/SAML) |
| API Keys | API key management for programmatic access |
Scan Detail component architecture: The Scan Detail page is built from 10 focused tab components in frontend/src/components/scan/, each owning its own state, queries, and rendering. The parent ScanDetail.tsx is a ~130-line shell that handles routing and tab navigation. Shared utilities live in frontend/src/lib/ (geo constants, HTML sanitization, error handling).



Six LLM-powered analysis agents that subscribe to Redis event bus
events and produce structured intelligence. All agents extend BaseAgent
with concurrency control, timeout handling, and Prometheus metrics.
| Agent | Event Types | Output |
|---|---|---|
| FindingValidator | MALICIOUS_, VULNERABILITY_, LEAKED_* | verdict, confidence, remediation |
| CredentialAnalyzer | LEAKED_CREDENTIALS, PASSWORD_, API_KEY_ | severity, is_active, risk_factors |
| TextSummarizer | RAW_, TARGET_WEB_CONTENT, PASTE_, DOCUMENT_* | summary, entities, sentiment |
| ReportGenerator | SCAN_COMPLETE, REPORT_REQUEST | executive_summary, threat_assessment |
| DocumentAnalyzer | DOCUMENT_UPLOAD, USER_DOCUMENT, REPORT_UPLOAD | entities, IOCs, classification |
| ThreatIntelAnalyzer | MALICIOUS_, BLACKLISTED_, CVE_, DARKNET_ | MITRE ATT&CK mapping, threat actors |
All agents route LLM calls through LiteLLM (:4000) for unified model selection, cost tracking, and provider failover.
| Method | Endpoint | Description |
|---|---|---|
| POST | /agents/process | Process a single event through matching agents |
| POST | /agents/analyze | Deep analysis of a specific finding |
| POST | /agents/report | Generate a comprehensive scan report |
| GET | /agents/status | Status of all agents and pending tasks |
| GET | /metrics | Prometheus metrics |
| GET | /health | Health check |
Converts documents to text, extracts entities and IOCs, and stores results in MinIO.
PDF (pypdf), DOCX (python-docx), XLSX (openpyxl), HTML, RTF (striprtf), plain text. Optional Apache Tika fallback for complex documents.
Pre-compiled regex patterns for: IPv4/IPv6, emails, URLs, domains, MD5/SHA1/SHA256 hashes, phone numbers, CVEs, Bitcoin/Ethereum addresses, AWS keys, credit cards. Smart deduplication and private IP filtering.
| Method | Endpoint | Description |
|---|---|---|
| POST | /enrichment/upload | Upload and process a document (100MB limit) |
| POST | /enrichment/process-text | Process raw text content |
| POST | /enrichment/batch | Batch process multiple documents |
| GET | /enrichment/results/{id} | Fetch enrichment results |
| GET | /enrichment/results | List all enrichment results |
| GET | /metrics | Prometheus metrics |
| GET | /health | Health check |
Allows users to supply their own documents, IOCs, reports, and context data to augment automated OSINT collection.
| Method | Endpoint | Description |
|---|---|---|
| POST | /input/document | Upload document β enrichment β agent analysis |
| POST | /input/iocs | Submit IOC list with deduplication |
| POST | /input/report | Submit structured report β entity extraction |
| POST | /input/context | Set scope/exclusions/threat model for a scan |
| POST | /input/targets | Batch target list for multi-scan |
| GET | /input/submissions | List all submissions |
| GET | /input/submissions/{id} | Get submission details |
Unified proxy supporting 100+ LLM providers through an
OpenAI-compatible API. Configuration in infra/litellm/config.yaml.
| Model Alias | Provider | Purpose |
|---|---|---|
| gpt-4o | OpenAI | Complex analysis (reports, threat intel) |
| gpt-4o-mini | OpenAI | Default for most agents |
| gpt-3.5-turbo | OpenAI | Fast, low-cost tasks |
| claude-sonnet | Anthropic | Alternative for complex reasoning |
| claude-haiku | Anthropic | Fast Anthropic alternative |
| ollama/llama3 | Ollama (local) | Self-hosted, no API key needed |
| ollama/mistral | Ollama (local) | Self-hosted coding/analysis |
-
defaultβ gpt-4o-mini -
fastβ gpt-3.5-turbo -
smartβ gpt-4o -
localβ ollama/llama3
Vector.dev replaces both Promtail and OpenTelemetry Collector as a unified telemetry pipeline:
- Logs: Docker container logs β JSON parse β route by level β Loki + MinIO
- Events: HTTP source (:8686) β enrich with category/risk β MinIO + file
- Metrics: Internal metrics β Prometheus exporter (:9598)
- Traces: OTLP receiver (:4317 gRPC, :4318 HTTP) β forward to Jaeger
Five pre-provisioned dashboards in infra/grafana/dashboards/:
| Dashboard | Panels | Focus |
|---|---|---|
| SpiderFoot β Platform Overview | 19 | Scan counts, event rates, risk distribution, API latency |
| SpiderFoot β Scan Operations | 22 | Active scans, module execution, queue depths, throughput |
| SpiderFoot β Celery Task Queue | 16 | Task success/failure rates, queue lengths, worker concurrency |
| SpiderFoot β Infrastructure | 22 | Container CPU/memory, PostgreSQL, Redis, Qdrant, MinIO health |
| SpiderFoot β Service Logs | 17 | Structured log explorer, error rates, log volume per service |
spiderfoot-api, spiderfoot-scanner, spiderfoot-agents, spiderfoot-enrichment, vector, qdrant, minio, jaeger, litellm, prometheus (self-monitoring).
OpenTelemetry instrumentation via spiderfoot/observability/tracing.py
with graceful no-op fallback when SDK not installed. Traces flow:
Service β OTLP β Vector.dev :4317 β Jaeger :4317.
| Bucket | Purpose |
|---|---|
sf-logs |
Vector.dev archived logs and events |
sf-reports |
Generated scan reports |
sf-pg-backups |
PostgreSQL backups |
sf-qdrant-snapshots |
Vector DB snapshots |
sf-data |
General application data |
sf-loki-data |
Loki chunk/index storage |
sf-loki-ruler |
Loki ruler data |
sf-enrichment |
Enrichment pipeline documents |
Custom HTTP-based Qdrant client (NOT the PyPI qdrant-client). Communicates
with Qdrant via urllib.request REST calls.
-
Singleton via
get_qdrant_client()/init_qdrant_client() -
Backends:
MemoryVectorBackend(testing),HttpVectorBackend(production) -
Collection prefix:
sf_(configurable viaSF_QDRANT_PREFIX) -
Key classes:
VectorPoint(id, vector, payload, score),SearchResult(points, query_time_ms, total_found),Filter(must, must_not, should)withmatch()/range()statics,CollectionInfo(name, vector_size, distance, point_count) -
Methods:
ensure_collection,search,upsert,get,delete,scroll,count,collection_info,list_collections
Generates vector embeddings for text data:
- Providers: MOCK (default), SENTENCE_TRANSFORMER, OPENAI, HUGGINGFACE
-
Default model:
all-MiniLM-L6-v2(384 dimensions) -
Methods:
embed_text(),embed_texts()with caching and batching
5 correlation strategies over vectorized scan events:
| Strategy | Description |
|---|---|
| SIMILARITY | Cosine similarity within a scan |
| CROSS_SCAN | Similar events across different scans |
| TEMPORAL | Time-windowed clustering |
| INFRASTRUCTURE | Infrastructure topology grouping |
| MULTI_HOP | Multi-step relationship discovery |
Default collection: sf_osint_events
S3-compatible object storage for artifacts, reports, and backups.
| Bucket | Purpose |
|---|---|
| spiderfoot-reports | Generated scan reports (PDF/HTML/MD) |
| spiderfoot-exports | Exported scan data (CSV/JSON/STIX) |
| spiderfoot-artifacts | Raw scan artifacts and screenshots |
| spiderfoot-backups | PostgreSQL pg_dump archives |
| spiderfoot-logs | Archived log files |
- MinioStorageClient: Upload, download, list, delete, presigned URLs
-
Singleton:
get_minio_client()with automatic bucket creation - Lifecycle: Configurable retention policies per bucket
Runs in the sf-pg-backup container:
- Hourly
pg_dumpof the SpiderFoot database - Compressed archives uploaded to the
spiderfoot-backupsbucket - Configurable retention (default: 7 days)
- Health check via backup recency validation
Code-first GraphQL using Strawberry β₯ 0.235.0, mounted at /api/graphql
with GraphiQL IDE.
| Field | Return Type | Description |
|---|---|---|
scan(id) |
ScanType | Single scan by ID |
scans(page, pageSize) |
PaginatedScans | Paginated scan list |
scanEvents(scanId, filter, pagination) |
PaginatedEvents | Filtered events |
eventSummary(scanId) |
[EventTypeSummary] | Event type counts |
scanCorrelations(scanId) |
[CorrelationType] | Correlation hits |
scanLogs(scanId) |
[ScanLogType] | Module execution logs |
scanStatistics(scanId) |
ScanStatistics | Aggregate scan stats |
scanGraph(scanId, maxNodes) |
ScanGraph | D3 graph data |
eventTypes |
[EventTypeInfo] | Available event types |
workspaces |
[WorkspaceType] | Scan workspaces |
searchEvents(query, scanId) |
PaginatedEvents | Full-text search |
semanticSearch(query, ...) |
VectorSearchResult | Qdrant vector search |
vectorCollections |
[VectorCollectionInfo] | Qdrant collections |
| Mutation | Return Type | Description |
|---|---|---|
startScan(input) |
ScanCreateResult | Create and start a scan |
stopScan(scanId) |
MutationResult | Abort a running scan |
deleteScan(scanId) |
MutationResult | Delete scan + data |
setFalsePositive(input) |
FalsePositiveResult | Toggle FP status |
rerunScan(scanId) |
ScanCreateResult | Clone and rerun a scan |
| Subscription | Yields | Description |
|---|---|---|
scanProgress(scanId, interval) |
ScanType | Polls scan status changes |
scanEventsLive(scanId, interval) |
EventType | New events as they appear |
Protocols: graphql-transport-ws, graphql-ws
- QueryDepthLimiter: Max depth = 10 (prevents deeply nested abuse)
-
DataLoaders:
ScanEventLoader,ScanCorrelationLoader(N+1 prevention)
All 309 existing modules continue to work unchanged. They use self.sf
(the SpiderFoot god object) for HTTP, DNS, and other operations.
New modules can extend SpiderFootModernPlugin to access services directly:
from spiderfoot.plugins.modern_plugin import SpiderFootModernPlugin
class sfp_example(SpiderFootModernPlugin):
def handleEvent(self, event):
# Modern: uses HttpService with metrics
res = self.fetch_url("https://api.example.com/lookup")
# Cache results
self.cache_put("key", data, ttl=3600)
# DNS resolution via DnsService
addrs = self.resolve_host(hostname)See MODULE_MIGRATION_GUIDE.md for step-by-step migration instructions.
| Version | Change |
|---|---|
| 6.0.0 | Microservices-only architecture β monolith mode removed. 23-container compose stack, profile-based activation (scan, proxy, storage, monitor, ai, scheduler, sso, full). IaC Map feature (spiderfoot/iac/). 309 modules. 95 correlation rules. 5 Grafana dashboards. |
| 5.246.0 | GraphQL mutations (5), subscriptions (2, WebSocket), Qdrant semantic search resolver, query depth limiter, MinIO object storage (5 buckets), PG backup sidecar, complete documentation overhaul |
| 5.245.0 | Complete shim removal β 79 backward-compat files deleted, 470 imports rewritten to 8 domain sub-packages |
| 5.244.0 | Fix circular imports across all 8 sub-packages (relative imports) |
| 5.243.0 | Populate 8 domain sub-packages (events, scan, plugins, config, security, observability, services, reporting) |
| 5.4.0 | EventBus abstraction (Memory/Redis/NATS) |
| 5.4.1 | Structured JSON logging |
| 5.4.2 | Vector.dev integration |
| 5.5.0 | DataService abstraction |
| 5.5.1 | HttpService extraction |
| 5.5.2 | DnsService extraction |
| 5.5.3 | CacheService (Memory/File/Redis) |
| 5.6.0 | ServiceRegistry + dependency injection |
| 5.6.1 | Module WorkerPool |
| 5.6.2 | ScanScheduler |
| 5.7.0 | Docker microservices decomposition |
| 5.7.1 | Prometheus metrics |
| 5.8.0 | SpiderFootModernPlugin base class |
| 5.8.1 | Service integration wiring |
| 5.8.2 | ConfigService with env overrides |
| 5.9.0 | Celery tasks, scan profiles, PostgreSQL reports, light theme |
| 5.9.1 | API Gateway with circuit breaker |
| 5.9.2 | Correlation Service (standalone) |
| 5.10.0 | Module migration samples + guide |
| 5.10.1 | Architecture docs + README overhaul |
| 5.10.2 | K8s health checks (liveness/readiness/startup) |
| 5.11.0 | Modern CLI with subcommands |
| 5.11.1 | Auth middleware (JWT/API-key/Basic + RBAC) |
| 5.12.0 | Export Service (JSON/CSV/STIX/SARIF) |
| 5.12.1 | Module dependency graph + visualization |
| 5.12.2 | Event schema validation (70+ schemas) |
| 5.13.0 | WebSocket real-time event streaming |
| 5.13.1 | Scan profiles/templates (10 built-in) |
| 5.13.2 | Module hot-reload |
| 5.14.0 | Retry/recovery framework + dead-letter queue |
| 5.15.0 | Kubernetes Helm chart |
| 5.15.1 | Plugin marketplace registry |
| 5.15.2 | Rate limiter service (token-bucket/sliding-window) |
| 5.16.0 | CI/CD pipelines (4 GitHub Actions workflows) |
| 5.16.1 | Notification service (Slack/Webhook/Email) |
| 5.16.2 | Audit logging (immutable trail) |
| 5.17.0 | Scan diff/comparison |
| 5.17.1 | Data retention policies |
| 5.17.2 | OpenAPI 3.1 spec generator |
| 5.18.0 | Plugin testing framework |
| 5.18.1 | Distributed scan coordinator |
| 5.18.2 | Performance benchmarking suite |
| 5.19.0 | Secret management (encrypted file backend) |
| 5.19.1 | API versioning framework |
| 5.19.2 | Error telemetry (fingerprinting + alerting) |
| 5.20.0 | Scan queue with backpressure |
| 5.20.1 | Module dependency resolver |
| 5.21.0 | Database migration framework |
| 5.22.0 | Unified structured logging (JSON + correlation) |
| 5.22.1 | Vector.dev pipeline bootstrap + health checks |
| 5.22.2 | LLM Report Preprocessor (chunk / summarize) |
| 5.22.3 | Context window / token budget manager |
| 5.22.4 | OpenAI-compatible LLM client |
| 5.22.5 | Report generator pipeline orchestrator |
| 5.22.6 | Multi-format report renderer (PDF/HTML/MD/JSON) |
| 5.22.7 | Report REST API |
| 5.22.8 | Report storage engine (PostgreSQL + LRU) |
| 5.22.9 | Module Registry (discovery, dependency, categories) |
| 5.23.0 | EventBus Hardening (DLQ, circuit breaker, retry) |
| 5.23.1 | Wire ReportStore into API layer |
| 5.23.2 | Typed AppConfig (11 dataclass sections, validation) |
| 5.23.3 | Health Check API (7 endpoints, 6 subsystem probes) |
| 5.23.4 | Scan Progress API (SSE streaming) |
| 5.23.5 | Task Queue (ThreadPool, callbacks, state machine) |
| 5.23.6 | Webhook/Notification System (HMAC, retries) |
| 5.23.7 | Request Tracing Middleware (X-Request-ID, timing) |
| 5.23.8 | Event Relay + WebSocket rewrite (push, not polling) |
| 5.23.9 | Config API Modernization (AppConfig wired into API) |
| 5.24.0 | Scan Event Bridge (live scanner events β WebSocket) |
| 5.25.0 | Module Dependency Resolution (registry β scanner wiring) |
| 5.26.0 | Database Repository Pattern (Scan/Event/Config repos) |
| 5.27.0 | API Rate Limiting Middleware (per-tier, per-client) |
| 5.28.0 | API Pagination Helpers (PaginationParams, PaginatedResponse) |
| 5.29.0 | Correlation Service Wiring (CorrelationService β router) |
| 5.30.0 | Scan Service Facade (ScanStateMachine + ScanRepository β router) |
| 5.31.0 | Visualization Service Facade (graph/summary/timeline/heatmap β router) |
| 5.32.0 | Scan Service Phase 2 (all 25 endpoints β ScanService, zero raw DB) |
| 5.33.0 | Final Router DB Purge + Dead Code Removal |
| 5.34.0 | WebUI DB Access Centralisation (DbProvider mixin) |
| 5.35.0 | Fix silent error swallowing in service_integration.py |
| 5.36.0 | Add gRPC dependencies to requirements.txt |
| 5.37.0 | Generate gRPC stubs, wire grpc_service.py |
| 5.38.0 | Unified scan state mapping (scan_state_map.py) |
| 5.39.0 | Replace monkey-patching with functools.wraps |
| 5.40.0 | Framework-agnostic security + deprecate Flask |
| 5.41.0 | Migrate ScanService events to EventRepository |
| 5.42.0 | Domain sub-packages for code organization |
| 5.43.0 | HTTP DataService client (REST backend) |
| 5.43.1 | DataService health check endpoints |
| 5.44.0 | gRPC DataService client (Protobuf backend) |
| 5.44.1 | Circuit breaker for remote DataService |
| 5.45.0 | Extract ScanMetadataService |
| 5.46.0 | WebUI API proxy layer |
| 5.47.0 | Per-service Docker network isolation |
| 5.48.0 | API versioning with /api/v1/ prefix |
| 5.49.0 | Pydantic v2 schemas for service boundaries |
| 5.50.0 | Module interface contracts (Protocol + validation) |
| 5.51.0 | ConfigService microservice enhancements |
| 5.52.0 | Proto schema expansion (15 new RPCs + CorrelationService) |
| 5.53.0 | Service startup sequencer |
| 5.54.0 | Graceful shutdown coordination |
| 5.54.1 | Wire startup/shutdown into entry points |
| 5.55.0 | Wire Pydantic response_model on scan router |
| 5.56.0 | Structured API error responses (ErrorResponse envelope) |
| 5.56.1 | Rich OpenAPI metadata (tags, license, description) |
| 5.57.0 | Config source tracing + environment API |
| 5.58.0 | Scan lifecycle event hooks (EventBus integration) |
| 5.59.0 | Module execution timeout guard |
| 5.60.0 | Inter-service authentication (static + HMAC tokens) |
| 5.60.1 | Wire service auth into HTTP clients + docker-compose |
| 5.61.0 | API request audit logging middleware |
| 5.62.0 | Module output validation (producedEvents enforcement) |
| 5.62.1 | Documentation update for Cycles 55-64 |
| 5.63.0 | Unified scan export API (STIX/SARIF/JSON/CSV) |
| 5.63.1 | Wire pagination into workspace + data routers |
| 5.64.0 | Health check deep probes (4 new subsystem checks) |
| 5.64.1 | Comprehensive live config validation endpoint |
| 5.65.0 | Correlation results export API (CSV/JSON) |
| 5.65.1 | Workspace response schemas + response_model |
| 5.66.0 | API key rotation endpoint |
| 5.67.0 | Scan comparison endpoint |
| 5.67.1 | Documentation update for Cycles 65-74 |
| 5.68.0 | Body size limiter middleware |
| 5.68.1 | CORS middleware |
| 5.69.0 | Module runtime statistics endpoint |
| 5.70.0 | Scan tag/label management |
| 5.71.0 | Bulk scan operations |
| 5.72.0 | Per-endpoint rate limit configuration |
| 5.73.0 | Webhook event filtering + discovery |
| 5.74.0 | Module dependency graph endpoint |
| 5.74.1 | Documentation update for Cycles 75-83 |
| 5.75.0 | Recurring scan schedule API (interval/one-shot) |
| 5.75.1 | Response schemas wiring (config + data routers) |
| 5.76.0 | Request ID propagation (HTTP/gRPC/webhooks) |
| 5.77.0 | Scan timeline endpoint (chronological events) |
| 5.78.0 | Module enable/disable API (runtime management) |
| 5.79.0 | Scan search/filter API (faceted results) |
| 5.80.0 | Graceful shutdown manager (signals + FastAPI lifespan) |
| 5.80.1 | Documentation update for Cycles 84-91 |
| 5.81.0 | Streaming JSONL export for large scans |
| 5.82.0 | Per-event annotations API |
| 5.83.0 | API key scoping (predefined permission sets) |
| 5.84.0 | Config change history + diff-against-defaults |
| 5.85.0 | Event deduplication detection endpoint |
| 5.86.0 | Per-module config validation |
| 5.87.0 | Scan retry for failed/aborted scans |
| 5.88.0 | Response compression middleware (gzip) |
| 5.88.1 | Final documentation update β Cycle 100 |
| Version | Feature |
|---|---|
| 6.0.0-b40 | Docker Compose modularization via include directive |
| 6.0.0-b41 | OpenAPI TypeScript SDK generation (@hey-api/openapi-ts) |
| 6.0.0-b42 | JSONL streaming export + SSE live event stream |
| 6.0.0-b43 | Native async I/O engine (aiohttp + aiodns) |
| 6.0.0-b44 | AI structured outputs (Pydantic schemas + json_schema mode) |
| 6.0.0-b45 | PEP 561 strict typing + mypy configuration |
| 6.0.0-b46 | Documentation overhaul |
JWT, API key, and Basic authentication with role-based access control (ADMIN, ANALYST, VIEWER, API roles). Pluggable into any ASGI/WSGI app.
Multi-format scan result export: JSON, CSV, STIX 2.1 bundles, and SARIF for integration with CI/CD security tooling.
Real-time scan event streaming over WebSocket with channel-based subscriptions per scan, module, or event type.
Multi-channel alerting with wildcard topic subscriptions, supporting Slack webhooks, generic webhooks, SMTP email, and log output.
Secure API key and credential storage with four backends: in-memory, environment variables, plain JSON file, and encrypted file (PBKDF2 + XOR). Includes rotation tracking, access auditing, and config injection.
Centralised error capture with fingerprint-based deduplication, automatic classification (network/auth/parse/timeout/etc.), sliding-window rate tracking, and configurable alert thresholds with callbacks.
Bounded priority queue (HIGH/NORMAL/LOW) with backpressure support. Three overflow strategies (BLOCK/REJECT/DROP_OLDEST), batch dequeue, retry tracking with dead-letter queue.
Runtime dependency resolution for modules. Given desired output event types, walks backwards through the event dependency chain to compute the minimal module set and topological load order.
Version-controlled schema evolution with numbered migration files, upgrade/downgrade functions, dry-run mode, and checksum validation. Supports PostgreSQL dialect.
Registry-driven module loading adapter that replaces the scannerβs
legacy __import__ loop with ModuleRegistry for discovery/instantiation
and ModuleGraph for topological dependency ordering. Features:
- Registry-first loading with automatic legacy fallback
-
Topological execution order via Kahnβs algorithm (replaces
_prioritysort) - Minimal-set pruning β when desired output types are specified, only modules in the dependency chain are loaded
- Cycle detection with warnings (cycles donβt break execution)
- LoadResult dataclass with detailed statistics: loaded/failed/skipped counts, ordering method, pruning info, and timing
-
Global singleton with thread-safe
init_module_loader()/get_module_loader()/reset_module_loader() - Wired into scanner via
service_integration._wire_module_loader()
Clean abstraction over SpiderFootDb using the Repository Pattern.
Replaces 20+ direct SpiderFootDb(config) instantiations across API
routers with injectable, testable repository instances.
-
AbstractRepository β Base class with context-manager lifecycle,
dbhproperty,is_connected,close(),__enter__/__exit__ -
ScanRepository β Scan CRUD:
create_scan(),get_scan(),list_scans(),update_status(),delete_scan(), config, logs, errors. IncludesScanRecorddataclass withfrom_row()/to_dict() -
EventRepository β Event/result operations:
store_event(),get_results(),get_unique_results(),get_result_summary(),search(), element sources/children (direct + recursive), false-positive management, batch log events -
ConfigRepository β Global config:
set_config(),get_config(),clear_config() -
RepositoryFactory β Creates repos with shared or per-request DB
handles. Thread-safe singleton via
init_repository_factory()/get_repository_factory()/reset_repository_factory() -
FastAPI Depends providers β
get_scan_repository(),get_event_repository(),get_config_repository()inapi/dependencies.pywith automatic lifecycle management - Wired into scanner via
service_integration._wire_repository_factory()
Starlette/FastAPI middleware bridging the existing RateLimiterService
into the API layer. Every incoming request is checked against per-tier
rate limits before reaching the router.
-
Per-client identity extraction from API key,
X-Forwarded-For, or direct IP -
Route-tier mapping β
/api/scansβ scan tier,/api/dataβ data tier, etc. -
429 Too Many Requests with
Retry-Afterheader on limit exceeded - X-RateLimit-* response headers (Limit, Remaining, Reset) on every response
- Exempt paths for health checks, docs, OpenAPI spec
- Per-client buckets β different API keys/IPs have independent limits
- RateLimitStats with tier-level and top-offender tracking
-
install_rate_limiting(app, config)wiring function - Installed in
api/main.pyafter request tracing middleware
Standardized pagination across all API list endpoints with consistent request parameters and response envelopes.
-
PaginationParams β FastAPI
Depends()-compatible query extractor supporting page-based (page/page_size) and offset-based (offset/limit) modes with automatic mapping between them -
PaginatedResponse β Standardized envelope:
items,total,page,page_size,pages,has_next,has_previous -
paginate()β In-memory slicing with optional sort support -
paginate_query()β For pre-sliced DB results with total count -
Sort helpers β
dict_sort_key(),attr_sort_key()for common patterns -
RFC 8288 Link headers β
generate_link_header()fornext/prev/first/lastnavigation -
make_params()β Convenience constructor for programmatic/test use
Rewritten to delegate to CorrelationService instead of raw
SpiderFootDb / config manipulation. All 7 endpoints now use the
service layer and Cycle 25 pagination.
-
Rule CRUD β
add_rule(),get_rule(),update_rule(),delete_rule(),filter_rules()added toCorrelationService -
get_correlation_svcβ FastAPIDepends()provider independencies.pyreturning the singleton service -
Real execution β Test/analyze endpoints call
svc.run_for_scan()with actual timing instead of hardcoded results -
Pagination β List and detailed endpoints use
PaginationParams-
paginate()for standardized response envelopes
-
-
No direct DB access β All
SpiderFootDb(config.get_config())andjson.dumps()/configSet()calls eliminated from the router
Unified scan lifecycle management combining ScanRepository (Cycle 23)
with ScanStateMachine for formal state-transition enforcement.
-
ScanServiceβ High-level facade wrapping repository + state machine with CRUD methods:list_scans,get_scan,create_scan,delete_scan,delete_scan_full,stop_scan,get_scan_state -
State Machine Integration β
stop_scan()validates transitions (RUNNINGβSTOPPING, CREATEDβCANCELLED) before persisting; returns HTTP 409 Conflict when transition is invalid (e.g. stopping a completed scan) -
get_scan_serviceβ FastAPIDepends()generator provider independencies.pywith automatic lifecycle management -
Pagination β
list_scansendpoint usesPaginationParams+paginate()for standardized response envelopes -
Typed records β Endpoints return
ScanRecord.to_dict()instead of raw tuple-index dicts (scan[0],scan[6], etc.) -
Gradual migration β Service exposes
.dbhfor endpoints not yet migrated (export, viz); full migration planned for future cycles
Centralises all SpiderFootDb instantiation across the CherryPy
WebUI into a single overridable _get_dbh() method.
-
78 per-request
SpiderFootDb(self.config)calls replaced acrossscan.py(62),export.py(7),settings.py(4),helpers.py(1),info.py(1),routes.py(3) -
DbProvidermixin added toWebUiRoutesMRO β all endpoint classes inherit_get_dbh(config=None)via diamond inheritance -
Single override point β tests or future service migration can
replace
_get_dbh()instead of patching 78+ instantiation sites -
Config override β
_get_dbh(cfg)for cases needingdeepcopyconfig (e.g.rerunscan,rerunscanmulti) -
CHANGELOG β
event_schema.pyentry annotated as deleted in v5.33.0
Completes the ScanService facade migration started in Cycle 27,
eliminating all raw SpiderFootDb instantiation from the scan router.
-
15 new ScanService methods β
get_events(),search_events(),get_correlations(),get_scan_logs(),get_metadata()/set_metadata(),get_notes()/set_notes(),archive()/unarchive(),clear_results(),set_false_positive(),get_scan_options() -
All 25 scan endpoints now delegate to
ScanServiceviaDepends(get_scan_service)β zeroSpiderFootDbimports remain - Export endpoints β CSV/XLSX event export, multi-scan JSON export, search export, logs export, correlations export
- Lifecycle endpoints β create, rerun, rerun-multi, clone
- Results management β false-positive with parent/child validation, clear results
-
Metadata/notes/archive β CRUD with
hasattrguards for optional DB methods -
Static route ordering β
/scans/export-multi,/scans/viz-multi,/scans/rerun-multiregistered before/{scan_id}routes
Removes the last 3 raw SpiderFootDb instantiations from all API
routers, achieving a clean architectural boundary between the router
layer and the database.
-
config.py β
GET /event-typesnow usesConfigRepositoryviaDepends(get_config_repository). NewConfigRepository.get_event_types()method wrapsdbh.eventTypes(). -
reports.py β
_get_scan_events()helper rewritten to accept an injectedScanService. EndpointsPOST /reports/generateandPOST /reports/previewpass theirDepends(get_scan_service)instance. -
websocket.py β
_polling_mode()rewritten to build aScanServicefromRepositoryFactoryinstead of rawSpiderFootDb. Proper cleanup viasvc.close()infinallyblock. -
Dead code removal β
event_schema.py(655 lines) and its test file deleted (zero production imports). -
Verification β
grep -r "SpiderFootDb" spiderfoot/api/routers/returns only docstring/comment references, zero actual imports.
Dedicated service layer for scan data visualization, removing raw
SpiderFootDb from all 5 visualization router endpoints.
-
VisualizationServiceβ ComposesScanRepository+ rawdbhwith methods:get_graph_data,get_multi_scan_graph_data,get_summary_data,get_timeline_data,get_heatmap_data -
Smart scan validation β
_require_scan()usesScanRepositoryfirst, with rawdbh.scanInstanceGet()fallback -
Timeline aggregation β Handles both
datetimeand epoch timestamps; supports hour/day/week bucketing - Heatmap matrix β Builds x/y matrix from result dimensions (module/type/risk) with configurable axes
- Multi-scan graph β Merges results across scan IDs, skipping invalid scans with warning logs
-
get_visualization_serviceβ FastAPIDepends()generator provider independencies.pywith automatic lifecycle management -
Static route ordering β
/visualization/graph/multiregistered before/{scan_id}to avoid path parameter capture
Central fan-out hub bridging the EventBus to WebSocket/SSE consumers.
Per-scan consumer queues with bounded overflow (drop-oldest policy),
EventBus subscription management, and lifecycle helpers for
scan_started / scan_completed / status_update events.
Lightweight synchronous adapter that sits in the scanner's
waitForThreads() dispatch loop. Forwards each SpiderFootEvent
to the EventRelay for real-time WebSocket delivery. Features:
configurable per-event-type throttling, large-data truncation,
per-scan statistics, and a bridge registry for lifecycle management.
Starlette middleware that generates/echoes X-Request-ID headers,
sets contextvars for request context, and logs request start/end
with timing. Warns on slow requests exceeding a configurable threshold.
Outbound HTTP notification delivery with HMAC-SHA256 signing
(X-SpiderFoot-Signature), exponential backoff retries, delivery
history (bounded deque), and stats. Uses httpx with urllib fallback.
Webhook CRUD operations, event routing to matching/enabled webhooks, fire-and-forget async delivery, webhook testing, and integration with the Task Queue and Alert Engine for automated notifications.
ThreadPoolExecutor-backed task execution with TaskRecord state
machine (PENDING β RUNNING β COMPLETED/FAILED/CANCELLED), progress
tracking, completion callbacks, and a singleton task manager.
11-section typed dataclass configuration replacing the legacy flat
dict. Sections: Core, Network, Database, Web, API, Cache, EventBus,
Vector, Worker, Redis, Elasticsearch. Features: from_dict() /
to_dict() round-trip, apply_env_overrides() for SF_* variables,
20+ validation rules, and merge semantics for layered overrides.
High-performance HTTP/DNS client layer built on aiohttp + aiodns:
-
Session pool:
aiohttp.ClientSessioninstances cached per(module_name, event_loop)tuple β avoids recreating connections across calls -
async_fetch_url(): Returns dict matching the syncfetchUrl()shape (code,status,content,headers,realurl) -
DNS resolver:
aiodns.DNSResolverfor A/AAAA/PTR lookups withgetaddrinfo()fallback -
Wildcard detection:
async_check_dns_wildcard()for brute-force flood mitigation -
Plugin integration:
SpiderFootAsyncPluginsubclasses callasync_fetch_url()/async_resolve_host()directly β norun_in_executorwrapping
See Async Plugin Guide for module development.
Pydantic-validated structured output pipeline for LLM responses:
-
12 response models:
ScanReportOutput,ExecutiveSummaryOutput,RiskAssessmentOutput,ThreatAssessmentOutput,CorrelationOutput,FindingValidationOutput,TextSummaryOutput, plus nested types (Finding,ThreatIndicator,Recommendation,Attribution) -
chat_structured()inLLMClient: Builds OpenAIresponse_format: json_schemapayload frommodel.model_json_schema(), injects format hint into system message, validates response viamodel_validate() -
call_llm_structured()inAgentBase: Same pipeline available to AI agents -
Mock support:
_MockGenerator._mock_from_schema()produces schema-conformant test data
See AI Structured Outputs Guide for usage.
Auto-generated, type-safe API client for the React frontend:
-
Generator:
@hey-api/openapi-ts v0.92.4with native fetch client (not Axios) -
Post-processing:
dump_openapi.pynormalizesoperationIdvalues for clean SDK method names -
JWT interceptor:
client.interceptors.request.use()attachesAuthorization: Bearer <token>to every request -
Regeneration:
npm run generate:apire-generates from the live/api/openapi.json
See TypeScript SDK Guide for the generation workflow.
Pipeline-friendly scan result export:
-
Endpoint:
GET /api/scans/{id}/export/jsonlβ streams newline-delimited JSON -
SSE stream:
GET /events/streamβ Server-Sent Events for real-time scan event delivery -
Event enrichment:
SpiderFootEvent.asDict()produces full serializable event dicts -
Use cases:
jqpipelines, Elasticsearch bulk ingest, SIEM integration, real-time dashboards
See Streaming Export Guide for integration examples.
Static type safety enforcement across the Python codebase:
-
py.typedmarker inspiderfoot/β enables downstream type checking for packages that import SpiderFoot -
mypy config in
setup.cfg:python_version=3.11,check_untyped_defs=true,no_implicit_optional=true -
Coverage: DB layer (14 return types + 17 params fixed), core models (
__all__exports), network utilities (implicit Optional β explicit union types) -
Convention: Use
X | None(PEP 604) instead ofOptional[X]for all new code