AI Moderation System - openguard-bot/openguard GitHub Wiki

AI Moderation System

The AI Moderation System is the core intelligence of AIMod, providing sophisticated content analysis and automated moderation decisions. This system leverages state-of-the-art language models through LiteLLM to analyze messages, images, and user behavior.

🧠 Core AI Engine

LiteLLM Integration

AIMod uses LiteLLM as its AI abstraction layer, supporting multiple providers:

Supported Providers:

OpenRouter - Primary provider with access to multiple models
GitHub Copilot - Enterprise-grade AI with code understanding
OpenAI - Direct API integration
Anthropic Claude - Advanced reasoning capabilities
Google Gemini - Multimodal analysis

Configuration:

def get_litellm_client():
    return LiteLLM(
        api_base="https://openrouter.ai/api/v1",
        api_key=os.getenv("OPENROUTER_API_KEY"),
        model="github_copilot/gpt-4.1"  # Default model
    )

System Prompt Architecture

The AI system uses a sophisticated prompt template that includes:

Context Components:

Server rules and guidelines
Channel-specific rules
User role and permissions
Recent message history
Channel category and settings

Prompt Template Structure:

You are an AI moderation assistant for a Discord server with a very edgy and dark sense of humor. 
Your primary function is to analyze message content and attached media based STRICTLY on the 
server rules provided below, using all available context. Your default stance should be to 
IGNORE messages unless they are a CLEAR and SEVERE violation.

Server Rules:
---
{rules_text}
---

Context Information:
- User's Server Role: {user_role}
- Channel Category: {channel_category}
- Channel Age-Restricted/NSFW: {nsfw_status}
- Recent Channel History: {recent_messages}

Message to Analyze:
{message_content}

Respond with a JSON object containing your decision...

🔍 Message Processing Pipeline

1. Message Reception

@commands.Cog.listener(name="on_message")
async def message_listener(self, message: discord.Message):
    # Initial filtering
    if message.author.bot:
        return
    if not message.content and not message.attachments:
        return
    if not message.guild:
        return

2. Configuration Checks

# Check if moderation is enabled
if not await get_guild_config_async(message.guild.id, "ENABLED", True):
    return

# Check channel exclusions
if await is_channel_excluded(message.guild.id, message.channel.id):
    return

3. Global Ban Enforcement

# Auto-ban globally banned users
if message.author.id in GLOBAL_BANS:
    ban_reason = "Globally banned for severe universal violation."
    await message.guild.ban(message.author, reason=ban_reason)
    return

4. Content Preprocessing

# Truncate long messages
content = truncate_text(message.content, max_length=2000)

# Process attachments
attachment_descriptions = []
for attachment in message.attachments:
    if attachment.content_type and attachment.content_type.startswith('image/'):
        description = await self.media_processor.process_image(attachment.url)
        attachment_descriptions.append(description)

5. Context Building

# Get channel-specific or server rules
rules_text = await get_channel_rules(message.guild.id, message.channel.id)

# Build user context
user_role = "Administrator" if message.author.guild_permissions.administrator else "Member"
channel_category = message.channel.category.name if message.channel.category else "Uncategorized"
nsfw_status = getattr(message.channel, 'nsfw', False)

# Get recent message history
recent_messages = await get_recent_channel_history(message.channel, limit=5)

6. AI Analysis

# Construct the full prompt
system_prompt = SYSTEM_PROMPT_TEMPLATE.format(
    rules_text=rules_text,
    user_role=user_role,
    channel_category=channel_category,
    nsfw_status=nsfw_status,
    recent_messages=recent_messages,
    message_content=content,
    attachment_descriptions=attachment_descriptions
)

# Call AI service
response = await self.genai_client.acompletion(
    model=ai_model,
    messages=[{"role": "system", "content": system_prompt}],
    temperature=0.1,
    max_tokens=500
)

7. Decision Processing

# Parse AI response
ai_decision = json.loads(response.choices[0].message.content)

# Extract decision components
action = ai_decision.get("action", "IGNORE")
rule_violated = ai_decision.get("rule_violated", "")
reasoning = ai_decision.get("reasoning", "")
confidence = ai_decision.get("confidence", 0)

8. Action Execution

Based on the AI decision, the system executes appropriate moderation actions:

IGNORE: No action taken WARN: Delete message + send warning DM TIMEOUT: Timeout user + delete message + send notification BAN: Ban user + delete message + log action GLOBAL_BAN: Global ban + notify all servers

🎯 Decision Engine

Action Types

IGNORE

Default action for acceptable content
No moderation action taken
Message remains visible

WARN

For minor rule violations
Message is deleted
User receives warning DM
Infraction logged in database

TIMEOUT

For moderate violations
User is timed out (muted)
Duration based on violation severity
Message deleted and user notified

BAN

For severe violations
User is banned from the server
Recent messages deleted
Permanent record created

GLOBAL_BAN

For extreme violations
User banned across all servers
Added to global ban list
Immediate enforcement

Confidence Scoring

The AI provides confidence scores (0-100) for its decisions:

90-100: High confidence, automatic execution
70-89: Medium confidence, execute with logging
50-69: Low confidence, execute but flag for review
0-49: Very low confidence, log but don't execute

Rule Violation Categories

Content Violations:

Spam and excessive posting
NSFW content in inappropriate channels
Hate speech and discrimination
Harassment and bullying
Doxxing and privacy violations

Behavioral Violations:

Raid participation
Bot-like behavior
Evading moderation
Impersonation
Malicious links

🖼️ Media Processing

Image Analysis

The system can analyze images using computer vision:

class MediaProcessor:
    async def process_image(self, image_url: str) -> str:
        # Download and process image
        image_data = await self.download_image(image_url)
        
        # OCR text extraction
        extracted_text = self.extract_text(image_data)
        
        # Content classification
        content_type = self.classify_content(image_data)
        
        # Generate description
        description = f"Image contains: {content_type}"
        if extracted_text:
            description += f" Text: {extracted_text}"
            
        return description

Capabilities:

OCR Text Extraction: Extract text from images
Content Classification: Identify NSFW or inappropriate content
Meme Detection: Recognize common meme formats
QR Code Scanning: Detect and analyze QR codes

Attachment Processing

Supported File Types:

Images: PNG, JPG, GIF, WebP
Documents: PDF, TXT (text extraction)
Archives: ZIP, RAR (content listing)
Audio/Video: Basic metadata extraction

⚙️ Configuration Options

Guild-Level Settings

# AI Moderation Settings
ENABLED: bool = True                    # Enable/disable AI moderation
AI_MODEL: str = "github_copilot/gpt-4.1"  # AI model to use
CONFIDENCE_THRESHOLD: int = 70          # Minimum confidence for action
RULES_TEXT: str = "..."                 # Server rules for AI context

Channel-Specific Settings

# Channel Exclusions
AI_EXCLUDED_CHANNELS: List[int] = []    # Channels to skip moderation

# Channel-Specific Rules
AI_CHANNEL_RULES: Dict[int, str] = {    # Custom rules per channel
    123456789: "This is a meme channel, be more lenient",
    987654321: "This is a serious discussion channel"
}

Advanced Configuration

# Timeout Durations (in seconds)
TIMEOUT_DURATIONS = {
    "minor": 300,      # 5 minutes
    "moderate": 3600,  # 1 hour
    "severe": 86400    # 24 hours
}

# Auto-escalation settings
AUTO_ESCALATE_ENABLED: bool = True
ESCALATION_THRESHOLDS = {
    "warnings": 3,     # Ban after 3 warnings
    "timeouts": 2      # Ban after 2 timeouts
}

📊 Performance Metrics

Response Times

Average AI Response: 2-5 seconds
Message Processing: <1 second (excluding AI)
Database Operations: <100ms
Cache Hits: 85-95% for configuration

Accuracy Metrics

False Positive Rate: <5%
False Negative Rate: <10%
User Appeal Success: ~15%
Moderator Override: <8%

Resource Usage

Memory per Message: ~50KB
CPU per Analysis: ~100ms
Database Queries: 2-4 per message
API Calls: 1 per analyzed message

🔧 Troubleshooting

Common Issues

AI Service Unavailable:

Fallback to rule-based moderation
Queue messages for later processing
Notify administrators of service issues

High False Positive Rate:

Adjust confidence thresholds
Review and update server rules
Fine-tune system prompts

Performance Issues:

Enable caching for frequent queries
Optimize database indexes
Consider rate limiting

Monitoring

Key Metrics to Monitor:

AI response times
Error rates and types
Decision confidence distribution
User appeal rates
Moderator override frequency

Next: Database System - Comprehensive database architecture and operations