Error Recovery Strategies - joehubert/ai-agent-design-patterns GitHub Wiki
Classification
Intent
To implement robust fallback mechanisms that maintain system functionality when primary approaches fail, with explicit error handling protocols that minimize disruption and preserve user experience.
Also Known As
Fault Tolerance Mechanisms, Exception Handling Patterns, Degradation Management, Resilient Processing Pathways
Motivation
AI agents frequently encounter unexpected errors during operation, including:
- Model inference failures or timeouts
- Tool execution errors or permission issues
- External API outages or rate limiting
- Invalid inputs or unexpected data formats
- Knowledge gaps or hallucinations when handling complex queries
Traditional error handling often treats failures as binary events, either succeeding or failing completely. However, in agentic systems, partial success or alternative processing pathways may still provide value to users. Unlike conventional software where errors might be fully anticipated, AI agents face unique challenges due to:
- The probabilistic nature of LLM outputs
- External dependencies with unpredictable availability
- Complex reasoning chains where errors can cascade
- Varying quality of responses depending on prompt and context
- The need to maintain conversational coherence even during recovery
Effective error recovery requires not just detecting failures but implementing strategic responses that preserve as much functionality as possible while clearly communicating limitations to users.
Applicability
This pattern should be applied when:
- Building mission-critical AI systems where failures have significant consequences
- Designing agents that rely on multiple external tools or services
- Implementing complex reasoning chains where individual steps may fail
- Creating user-facing applications where graceful error handling improves user experience
- Developing agents that must operate with varying levels of reliability in different environments
- Working with models prone to specific types of errors (hallucinations, limitations, etc.)
- Building systems that must adapt to varying quality of service from underlying models or APIs
Structure
To do...
Components
-
Error Detection System: Monitors agent operations and identifies failures through explicit error codes, timeout mechanisms, quality assessment, and validation checks.
-
Error Classification Engine: Categorizes detected errors based on severity, source, and recoverability to determine appropriate recovery strategies.
-
Recovery Strategy Repository: Maintains a collection of predefined recovery approaches for different error types, including retry logic, simplification strategies, and alternative processing pathways.
-
Strategy Selector: Chooses the most appropriate recovery strategy based on the error context, user needs, and available fallback options.
-
Execution Monitor: Manages the implementation of recovery strategies and evaluates their success, potentially attempting multiple approaches in sequence.
-
Transparency Manager: Handles communication with users about errors and recovery actions, balancing detailed explanations with simplicity.
-
Learning Mechanism: Collects data on error patterns and recovery effectiveness to improve future strategy selection and implementation.
Interactions
-
The Error Detection System continuously monitors agent operations across all processing steps.
-
When a failure occurs, the Error Classification Engine analyzes the error details to determine its type, severity, and potential recovery paths.
-
The Strategy Selector consults the Recovery Strategy Repository to identify the most appropriate recovery approach based on the error classification and contextual factors.
-
The Execution Monitor implements the selected strategy, tracking its progress and effectiveness.
-
If the initial recovery attempt fails, the system may escalate to more aggressive fallback options or combine multiple strategies.
-
Throughout the recovery process, the Transparency Manager communicates appropriate information to users about what went wrong and what steps are being taken.
-
After resolution (successful or not), the Learning Mechanism records information about the error and recovery process to improve future handling.
The interactions typically follow a progressive escalation pattern, starting with minimal interventions (like retries) before moving to more significant fallbacks (such as task simplification or human escalation).
Consequences
Benefits:
- Increases overall system reliability and uptime
- Improves user experience by avoiding abrupt failures
- Enables partial success in complex operations where some components fail
- Builds user trust through transparent communication about limitations
- Maintains conversational coherence even during recovery
- Reduces the need for human intervention in many failure scenarios
- Creates opportunities for learning and system improvement through error analysis
Limitations:
- Increases system complexity with additional recovery logic
- May increase latency when primary methods fail and fallbacks are used
- Creates potential for cascading failures if recovery strategies themselves have bugs
- Requires engineering effort to implement comprehensive error handling
- May create false confidence that all error scenarios are covered
- Can be difficult to test exhaustively due to the variety of possible failure modes
Performance implications:
- Recovery strategies typically have higher latency than successful primary paths
- Additional monitoring and error detection may add overhead to normal operations
- Memory and computation requirements increase with the sophistication of recovery options
- System throughput may decrease during recovery scenarios, potentially affecting multiple users
Implementation
-
Map the Failure Domain:
- Identify all potential failure points in your agent architecture
- Categorize errors by type, source, and severity
- Determine which errors are recoverable versus terminal
-
Design Tiered Recovery Strategies:
- For each error category, define multiple recovery approaches ordered by complexity
- Start with simple strategies (retries, parameter adjustments) before more complex fallbacks
- Define clear conditions for escalation between strategies
-
Implement Robust Monitoring:
- Add explicit validation of all external tool outputs
- Set appropriate timeouts for all operations
- Implement quality checks for LLM outputs
- Create health checks for all dependencies
-
Build Recovery Pathways:
- Implement alternative processing routes for critical functionality
- Create mechanisms to simplify complex queries that fail
- Design task decomposition approaches for partial completion
- Establish clear protocols for human escalation when necessary
-
Ensure Transparent Communication:
- Design error messages that clearly explain what went wrong
- Create informative but non-technical user notifications
- Implement progress updates during extended recovery operations
- Provide clear expectations about limitations during degraded operation
-
Establish Feedback Loops:
- Log all errors and recovery attempts with detailed context
- Create mechanisms to analyze error patterns over time
- Implement automated improvements to error handling based on past failures
- Regularly review recovery effectiveness metrics
-
Test Recovery Scenarios:
- Implement chaos testing to deliberately introduce failures
- Simulate various error conditions during development
- Create comprehensive test suites for recovery pathways
- Monitor recovery performance in production
Code Examples
To do...
Variations
Progressive Retry Pattern: A specialized implementation focusing on intelligent retry logic with exponential backoff, jitter, and adaptive parameters. Particularly useful for transient failures in external API calls or model inference.
Alternative Approach Pattern: Instead of retrying the same operation, this variation attempts completely different methods to achieve the same goal, such as using different models, tools, or reasoning approaches when the primary method fails.
Quality-Based Recovery: Rather than waiting for explicit errors, this variation proactively evaluates the quality of agent outputs against predefined criteria and initiates recovery when quality falls below acceptable thresholds.
Human-in-the-Loop Recovery: A variation that integrates human assistance into the recovery process for complex or high-stakes scenarios, with clear protocols for when and how to escalate issues to human operators.
Context Preservation Pattern: Focuses specifically on maintaining conversational context and coherence during recovery, ensuring that even when errors occur, the agent can resume natural interaction.
Real-World Examples
Customer Service Virtual Agents: When unable to understand a complex customer query, these agents use progressive recovery by first asking clarifying questions, then attempting to address parts of the query they can handle, before ultimately offering to connect with a human agent if recovery fails.
AI-Powered Search Systems: When primary semantic search fails to return relevant results, these systems fall back to keyword matching, then to broader category searches, and finally to guided navigation options rather than showing no results.
Autonomous Decision Systems: Financial trading systems implement multi-tiered recovery where algorithm failures trigger increasingly conservative fallback strategies, from retrying with adjusted parameters to applying simplified heuristics, and ultimately pausing operations when confidence is too low.
Content Moderation Platforms: When automated content analysis encounters edge cases or ambiguous content, these systems implement recovery by routing to specialized models for deeper analysis, applying more conservative filtering rules, or escalating to human review.
Related Patterns
-
Graceful Degradation: Often paired with Error Recovery to maintain functionality at reduced capability levels when full recovery isn't possible.
-
Fallback Chains: Provides the structural framework for implementing progressive recovery strategies across multiple alternate pathways.
-
Confidence-Based Human Escalation: Offers a specific recovery pathway for scenarios where automated approaches have low confidence.
-
Reflection: Enhances error recovery by enabling agents to analyze their own failures and adjust recovery approaches accordingly.
-
Hierarchical Task Decomposition: Supports partial success strategies by providing frameworks to complete portions of tasks even when others fail.