Input Filtering - joehubert/ai-agent-design-patterns GitHub Wiki

Home::Overview of Patterns

Classification

Safety Pattern

Intent

To screen and sanitize user inputs for potentially problematic content before they reach the language model, reducing the risk of harmful, manipulative, or malicious prompts that could lead to undesired outputs or behaviors.

Also Known As

Input Screening, Prompt Filtering, Content Moderation, Input Sanitization, Request Validation

Motivation

Language models can be vulnerable to various forms of prompt manipulation, including:

  • Prompt injections that attempt to override system instructions
  • Jailbreaking attempts that try to circumvent safety guardrails
  • Harmful content that may trigger unsafe model responses
  • Inputs designed to extract confidential information or cause resource exhaustion

Traditional approaches that rely solely on the LLM itself to identify and reject problematic content are insufficient because:

  1. The model may not recognize subtle manipulation tactics
  2. By the time the model processes the input, it has already been exposed to potentially harmful content
  3. Malicious inputs can be crafted to bypass the model's built-in safeguards

Input Filtering provides a protective layer that analyzes and screens inputs before they reach the primary LLM, using specialized detection models, pattern matching, and other techniques specifically designed to identify problematic patterns.

Applicability

When to use this pattern:

  • In systems with public-facing LLM interfaces accessible to unauthenticated users
  • Applications handling sensitive data or providing critical services
  • When deployment includes high-stakes use cases (healthcare, financial, legal)
  • Systems that need to comply with specific content policies or regulations
  • Applications targeting vulnerable populations, including children
  • When LLM outputs could potentially trigger automated actions or workflows
  • For services where rapid response time is essential, and post-processing of problematic outputs is insufficient

Prerequisites for successful implementation:

  • Clear definition of what constitutes problematic content for your specific application
  • Regular updates to filtering rules as new evasion techniques emerge
  • Monitoring systems to track filter effectiveness and false positive rates

Structure

To do...

Components

The key elements participating in the pattern:

  • Input Receiver: The component that initially accepts user inputs and passes them to the filtering system before LLM processing.

  • Content Classifier: Specialized models trained to identify specific categories of problematic content (toxicity, hate speech, personal information, etc.).

  • Pattern Matcher: Rule-based systems that detect known patterns of prompt attacks, using regular expressions or similar pattern matching approaches.

  • Token/Embedding Analyzer: Components that examine token-level or embedding-level representations of inputs to identify anomalous patterns.

  • Policy Manager: Maintains the current filtering rules and thresholds, potentially customized for different user roles or contexts.

  • Decision Engine: Combines signals from various filtering components to make a final determination on whether to block, flag, or allow the input.

  • Feedback Collector: Gathers information about filter performance, including false positives and false negatives.

Interactions

How the components work together:

  1. The Input Receiver captures the incoming user request and passes it to the Decision Engine.

  2. The Decision Engine coordinates the filtering process, calling specialized components:

    • Content Classifier analyzes the input for prohibited content categories
    • Pattern Matcher checks for known attack patterns
    • Token/Embedding Analyzer looks for statistical anomalies
  3. Each component returns its assessment to the Decision Engine.

  4. The Decision Engine consults the Policy Manager to apply the appropriate policy rules and thresholds.

  5. Based on the combined analysis, the Decision Engine makes one of several determinations:

    • Allow the input to proceed to the LLM without modification
    • Block the input entirely with an appropriate message to the user
    • Modify/sanitize the input before passing it to the LLM
    • Flag the input for human review before processing
  6. The Feedback Collector logs the decision and outcome for further system improvement.

Consequences

The results and trade-offs of using the pattern:

Benefits:

  • Reduces the risk of prompt manipulation and jailbreaking attempts
  • Protects the system from malicious inputs before they reach the main model
  • Allows for specialized detection of different types of problematic content
  • Can be updated and modified independently of the main LLM
  • Reduces liability and risk for operators of LLM-based services
  • Can be customized based on application context and user permissions

Limitations:

  • Adds latency to request processing
  • May produce false positives that block legitimate requests
  • Requires ongoing maintenance to keep up with new attack vectors
  • Cannot eliminate all risks, especially for novel attack patterns
  • May struggle with contextual nuance that would be obvious to humans

Performance implications:

  • Lightweight filtering (e.g., pattern matching) adds minimal overhead
  • More sophisticated filtering (e.g., dedicated classifier models) increases compute requirements
  • Batching filter operations can improve throughput at the cost of individual request latency
  • Caching common input patterns can improve performance for repeated or similar requests

Implementation

Guidelines for implementing the pattern:

  1. Define clear filtering criteria: Establish what types of content should be blocked, flagged, or allowed based on your application's requirements and risk assessment.

  2. Start with layered defense:

    • Begin with simple pattern matching for known attack vectors
    • Add specialized classifiers for categories relevant to your use case
    • Implement statistical anomaly detection for novel attacks
  3. Balance sensitivity with usability:

    • Too strict filtering creates frustrating false positives
    • Too permissive filtering leaves vulnerabilities
    • Consider user-specific thresholds based on trust levels or use cases
  4. Implement graceful rejection:

    • Provide helpful feedback when content is filtered
    • Avoid revealing specific filter criteria that could help bypass the system
    • Consider offering alternatives or suggestions when possible
  5. Monitor and improve:

    • Log filtering decisions and outcomes
    • Regularly review false positives and false negatives
    • Update filter rules based on emerging threats and user feedback
  6. Consider human-in-the-loop for edge cases:

    • Route borderline cases to human moderators
    • Use human decisions to improve automated filtering

Common pitfalls to avoid:

  • Relying solely on blacklists that can be easily circumvented
  • Ignoring the evolution of attack methods
  • Setting thresholds too high or too low across all contexts
  • Failing to test filter effectiveness against targeted attacks
  • Not accounting for multilingual or encoded evasion techniques

Code Examples

To do...

Variations

Common modifications or adaptations of the basic pattern:

Tiered Filtering: Implements multiple filtering layers with increasing computational cost, where inputs pass through simpler filters before more complex ones.

  • Improves efficiency by quickly rejecting obviously problematic inputs
  • Reduces the load on more expensive filtering components

Context-Aware Filtering: Adjusts filtering criteria based on conversation history, user profile, or application context.

  • More permissive in trusted environments or for verified users
  • Stricter in high-risk contexts or with anonymous users

Federated Filtering: Distributes the filtering process across multiple components or services.

  • Allows specialized services to focus on specific types of problematic content
  • Improves scalability and enables easier updates to individual components

Adaptive Filtering: Automatically adjusts filtering thresholds based on observed patterns and feedback.

  • Tightens restrictions when attacks are detected
  • Relaxes overly strict filters that generate too many false positives

Multimodal Filtering: Extends filtering beyond text to include images, audio, or other input types.

  • Addresses risks in multimodal LLM applications
  • Requires specialized detection systems for each modality

Real-World Examples

Systems or applications where this pattern has been successfully applied:

  • Content moderation systems for social media platforms that filter user-generated content before processing
  • Enterprise chatbot deployments that screen inputs for potential security threats or confidential information leakage
  • Public-facing AI assistants that implement guardrails against prompt injection and jailbreaking attempts
  • Educational applications that filter student inputs for age-appropriate content before LLM processing
  • Healthcare applications that screen for potential personal health information (PHI) to maintain HIPAA compliance

Related Patterns

Other patterns that: