Input Filtering - joehubert/ai-agent-design-patterns GitHub Wiki

Classification

Intent

To screen and sanitize user inputs for potentially problematic content before they reach the language model, reducing the risk of harmful, manipulative, or malicious prompts that could lead to undesired outputs or behaviors.

Also Known As

Input Screening, Prompt Filtering, Content Moderation, Input Sanitization, Request Validation

Motivation

Language models can be vulnerable to various forms of prompt manipulation, including:

Prompt injections that attempt to override system instructions
Jailbreaking attempts that try to circumvent safety guardrails
Harmful content that may trigger unsafe model responses
Inputs designed to extract confidential information or cause resource exhaustion

Traditional approaches that rely solely on the LLM itself to identify and reject problematic content are insufficient because:

The model may not recognize subtle manipulation tactics
By the time the model processes the input, it has already been exposed to potentially harmful content
Malicious inputs can be crafted to bypass the model's built-in safeguards

Input Filtering provides a protective layer that analyzes and screens inputs before they reach the primary LLM, using specialized detection models, pattern matching, and other techniques specifically designed to identify problematic patterns.

Applicability

When to use this pattern:

In systems with public-facing LLM interfaces accessible to unauthenticated users
Applications handling sensitive data or providing critical services
When deployment includes high-stakes use cases (healthcare, financial, legal)
Systems that need to comply with specific content policies or regulations
Applications targeting vulnerable populations, including children
When LLM outputs could potentially trigger automated actions or workflows
For services where rapid response time is essential, and post-processing of problematic outputs is insufficient

Prerequisites for successful implementation:

Clear definition of what constitutes problematic content for your specific application
Regular updates to filtering rules as new evasion techniques emerge
Monitoring systems to track filter effectiveness and false positive rates

Structure

To do...

Components

The key elements participating in the pattern:

Input Receiver: The component that initially accepts user inputs and passes them to the filtering system before LLM processing.
Content Classifier: Specialized models trained to identify specific categories of problematic content (toxicity, hate speech, personal information, etc.).
Pattern Matcher: Rule-based systems that detect known patterns of prompt attacks, using regular expressions or similar pattern matching approaches.
Token/Embedding Analyzer: Components that examine token-level or embedding-level representations of inputs to identify anomalous patterns.
Policy Manager: Maintains the current filtering rules and thresholds, potentially customized for different user roles or contexts.
Decision Engine: Combines signals from various filtering components to make a final determination on whether to block, flag, or allow the input.
Feedback Collector: Gathers information about filter performance, including false positives and false negatives.

Interactions

How the components work together:

The Input Receiver captures the incoming user request and passes it to the Decision Engine.
The Decision Engine coordinates the filtering process, calling specialized components:
- Content Classifier analyzes the input for prohibited content categories
- Pattern Matcher checks for known attack patterns
- Token/Embedding Analyzer looks for statistical anomalies
Each component returns its assessment to the Decision Engine.
The Decision Engine consults the Policy Manager to apply the appropriate policy rules and thresholds.
Based on the combined analysis, the Decision Engine makes one of several determinations:
- Allow the input to proceed to the LLM without modification
- Block the input entirely with an appropriate message to the user
- Modify/sanitize the input before passing it to the LLM
- Flag the input for human review before processing
The Feedback Collector logs the decision and outcome for further system improvement.

Consequences

The results and trade-offs of using the pattern:

Benefits:

Reduces the risk of prompt manipulation and jailbreaking attempts
Protects the system from malicious inputs before they reach the main model
Allows for specialized detection of different types of problematic content
Can be updated and modified independently of the main LLM
Reduces liability and risk for operators of LLM-based services
Can be customized based on application context and user permissions

Limitations:

Adds latency to request processing
May produce false positives that block legitimate requests
Requires ongoing maintenance to keep up with new attack vectors
Cannot eliminate all risks, especially for novel attack patterns
May struggle with contextual nuance that would be obvious to humans

Performance implications:

Lightweight filtering (e.g., pattern matching) adds minimal overhead
More sophisticated filtering (e.g., dedicated classifier models) increases compute requirements
Batching filter operations can improve throughput at the cost of individual request latency
Caching common input patterns can improve performance for repeated or similar requests

Implementation

Guidelines for implementing the pattern:

Define clear filtering criteria: Establish what types of content should be blocked, flagged, or allowed based on your application's requirements and risk assessment.
Start with layered defense:
- Begin with simple pattern matching for known attack vectors
- Add specialized classifiers for categories relevant to your use case
- Implement statistical anomaly detection for novel attacks
Balance sensitivity with usability:
- Too strict filtering creates frustrating false positives
- Too permissive filtering leaves vulnerabilities
- Consider user-specific thresholds based on trust levels or use cases
Implement graceful rejection:
- Provide helpful feedback when content is filtered
- Avoid revealing specific filter criteria that could help bypass the system
- Consider offering alternatives or suggestions when possible
Monitor and improve:
- Log filtering decisions and outcomes
- Regularly review false positives and false negatives
- Update filter rules based on emerging threats and user feedback
Consider human-in-the-loop for edge cases:
- Route borderline cases to human moderators
- Use human decisions to improve automated filtering

Common pitfalls to avoid:

Relying solely on blacklists that can be easily circumvented
Ignoring the evolution of attack methods
Setting thresholds too high or too low across all contexts
Failing to test filter effectiveness against targeted attacks
Not accounting for multilingual or encoded evasion techniques

Code Examples

To do...

Variations

Common modifications or adaptations of the basic pattern:

Tiered Filtering: Implements multiple filtering layers with increasing computational cost, where inputs pass through simpler filters before more complex ones.

Improves efficiency by quickly rejecting obviously problematic inputs
Reduces the load on more expensive filtering components

Context-Aware Filtering: Adjusts filtering criteria based on conversation history, user profile, or application context.

More permissive in trusted environments or for verified users
Stricter in high-risk contexts or with anonymous users

Federated Filtering: Distributes the filtering process across multiple components or services.

Allows specialized services to focus on specific types of problematic content
Improves scalability and enables easier updates to individual components

Adaptive Filtering: Automatically adjusts filtering thresholds based on observed patterns and feedback.

Tightens restrictions when attacks are detected
Relaxes overly strict filters that generate too many false positives

Multimodal Filtering: Extends filtering beyond text to include images, audio, or other input types.

Addresses risks in multimodal LLM applications
Requires specialized detection systems for each modality

Real-World Examples

Systems or applications where this pattern has been successfully applied:

Content moderation systems for social media platforms that filter user-generated content before processing
Enterprise chatbot deployments that screen inputs for potential security threats or confidential information leakage
Public-facing AI assistants that implement guardrails against prompt injection and jailbreaking attempts
Educational applications that filter student inputs for age-appropriate content before LLM processing
Healthcare applications that screen for potential personal health information (PHI) to maintain HIPAA compliance

Related Patterns

Other patterns that:

Output Filtering: Complements Input Filtering by applying safety checks after LLM generation but before presenting to users
Sandboxing: Often used together with Input Filtering to provide layered protection for AI systems
Constitutions and Principles: Provides internal guardrails that complement external input filtering
Tool Usage Permission Systems: Works alongside Input Filtering to control access to sensitive capabilities
Router Pattern: May incorporate Input Filtering to make routing decisions based on content safety
Confidence-Based Human Escalation: Often triggered by Input Filtering when content falls into uncertain categories