On Error Handling And Resilience - robbiemu/aclarai GitHub Wiki
🛡️ Error Handling and Resilience Strategy
This document defines the error handling and resilience strategy used across distributed components in the aclarai system. It outlines guiding principles and shared mechanisms that help the system remain robust, fault-tolerant, and transparent when facing unexpected failures.
🎯 Purpose
Establish a consistent approach for anticipating, detecting, responding to, and recovering from errors—minimizing data loss, maximizing uptime, and providing actionable diagnostics.
🔧 Toolkit at a Glance
The following table summarizes the core resilience techniques used across the system:
Mechanism | When to Use | Key Components |
---|---|---|
Structured Logging | Always | All components |
Retry + Backoff | Transient API/db errors | LLM agents, Neo4j, Postgres |
Atomic File Writes | Config and Markdown updates | aclarai-core , Obsidian outputs |
Graceful Degradation | Low-quality or invalid data | Plugins, Evaluation Agents |
Circuit Breakers (planned) | Persistent service failures | External APIs (LLMs, databases) |
Idempotency & Deduplication | Reprocessing, sync jobs | vault-to-graph , scheduler |
I. Guiding Principles
-
Fail Gracefully, Not Silently: Errors should be explicitly logged and diagnosed. Non-critical failures should degrade functionality without crashing the system.
-
Protect Data Integrity: All data writes (e.g., vault, Neo4j) must be atomic and avoid corruption.
-
Separate Transient and Permanent Errors: Use retries only for transient issues (e.g., timeouts). Fail fast on permanent issues (e.g., invalid configs).
-
Isolate Failures: Errors in one part should not affect unrelated components.
-
Enable Observability: Errors must be visible through logs or status panels to allow rapid debugging.
-
Use Idempotency Where Possible: Repeated operations (e.g., retries) should not create duplicates or side effects.
II. Common Error Types
-
Transient Errors Temporary issues that usually resolve (e.g., network failures, rate limits).
-
Permanent Operational Errors Persistent issues due to misconfiguration or infrastructure (e.g., invalid API keys, disk full).
-
Data Errors Input-related problems (e.g., malformed JSON, unrecognized formats).
-
Logical Errors Bugs in the application logic (e.g., invalid state transitions).
III. Resilience Toolkit
1. Structured Logging
- Use: All services (
aclarai-core
,vault-watcher
,scheduler
,aclarai-ui
). - How: Use Python’s
logging
with structured output. - Why: Enables centralized log collection, traceability using IDs (e.g.,
claim_id
,block_id
).
2. Retries with Exponential Backoff
- Use: Transient errors from LLM APIs, Neo4j, Postgres, file I/O.
- How: Max 3 retries with exponential backoff.
- Why: Improves resilience without manual intervention.
- Examples: LLM evaluation agents, entailment evaluations.
3. Atomic File Writes
- Use: Writing Markdown, config files.
- How:
write-temp → fsync → rename
pattern. - Why: Ensures complete, consistent writes.
- Examples: Tier 1–3 creation, config updates.
4. Graceful Degradation / Filtering
- Use: Data errors or failed quality checks.
- How: Log, filter out, and optionally annotate with
null
scores or error flags. - Why: Keeps low-quality data from polluting the graph.
- Examples: Evaluation agents, duplicate detection, fallback formats.
5. Circuit Breakers (Planned)
- Use: Repeated failures from external services.
- How: Trip after failure threshold, test periodically for recovery.
- Why: Prevents cascading failures.
- Note: Considered for post-MVP.
6. Idempotency & Deduplication
- Use: Synchronization and reprocessing.
- How: Unique
aclarai:id
,ver=
, and hash checks. - Why: Safe retries and parallelism.
- Examples:
vault-to-graph
, scheduler jobs.
IV. Error Handling by Component
aclarai-core
- Focus: Data integrity and orchestration.
- Patterns: Atomic writes,
try/except
, error filtering based on evaluation results.
LLM Agents (Claimify, Evaluation, Summary)
- Focus: API interactions.
- Patterns: API retries, null scores on failure, detailed logging.
Plugins (Format Conversion)
- Focus: Parsing diverse inputs.
- Patterns: Return
None
on failure, log parsing errors, inform UI.
vault-watcher
- Focus: File system monitoring.
- Patterns: I/O resilience, batch/throttle events, fallback via full sync.
Neo4j & Postgres
- Focus: Persistent storage.
- Patterns: Retry with backoff, transaction wrapping, specific exception handling.
aclarai-scheduler
- Focus: Job orchestration.
- Patterns: Job-level retries, isolation, lifecycle logging, respect global pause flag.
V. Observability
System health and error states are surfaced through:
-
Structured Logs: Available from containers and CLI.
-
Gradio Status Panels:
- Review Panel: Job status, pause state, and error summaries.
- Import Panel: Real-time ingestion progress and summaries.
-
Markdown Metadata: Key evaluation results and error flags embedded directly in files.