๐Ÿ“š Enhanced RAG PDF Chat โ€ Documentation - SaktiSriraj/pdf-knowledge-extractor GitHub Wiki

๐Ÿ“š Enhanced RAG PDF Chat - Complete Documentation

Table of Contents

  1. ๐Ÿš€ Getting Started
  2. ๐Ÿ—๏ธ Architecture Overview
  3. ๐Ÿ“‹ API Reference
  4. ๐Ÿ”ง Configuration Guide
  5. ๐Ÿ“– User Guide
  6. ๐Ÿ› ๏ธ Developer Guide
  7. ๐Ÿš€ Deployment Guide
  8. โ“ Troubleshooting
  9. ๐Ÿค Contributing
  10. ๐Ÿ“„ Changelog

๐Ÿš€ Getting Started

Prerequisites

  • Python 3.8+ (Recommended: Python 3.9 or 3.10)
  • OpenRouter API Key (Free at openrouter.ai)
  • 4GB+ RAM for optimal performance
  • Internet connection for initial model downloads

Quick Installation

1. Clone Repository

git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor

2. Environment Setup

# Create virtual environment
python -m venv venv

Activate environment

Windows:

venv\Scripts\activate

macOS/Linux:

source venv/bin/activate

Install dependencies

pip install -r requirements.txt

3. Configuration

# Create environment file
echo "OPENROUTER_API_KEY=your_key_here" > .env

4. Initialize System

# Download and setup models (optional - auto-downloads on first use)
python install_enhanced.py

Launch application

streamlit run app.py

5. Access Application

Open your browser and navigate to: http://localhost:8501

First Run Checklist

  • Python 3.8+ installed
  • Virtual environment activated
  • Dependencies installed successfully
  • API key configured in .env file
  • Application launches without errors
  • Can access web interface

๐Ÿ—๏ธ Architecture Overview

System Architecture

graph TB
    A[๐Ÿ“„ Document Input] --> B[๐Ÿ”„ LangChain Processor]
    B --> C[โœ‚๏ธ Advanced Text Splitting]
    C --> D[๐Ÿง  Sentence Transformers]
    D --> E[๐Ÿ“Š FAISS Vector Index]
F[๐Ÿ’ฌ User Query] --> G[๐ŸŽฏ Intent Analysis]
G --> H[๐Ÿ” Multi-Strategy Retrieval]
H --> E
E --> I[๐Ÿ“ Context Assembly]
I --> J[๐Ÿค– DeepSeek R1 API]
J --> K[โœจ Enhanced Response]

style A fill:#e1f5fe,color:#000000
style E fill:#f3e5f5,color:#000000
style J fill:#e8f5e8,color:#000000
style K fill:#fff3e0,color:#000000

Core Components

๐Ÿ“„ Document Processing Pipeline

Component File Purpose Technology
Document Loader preprocess.py PDF/Text extraction LangChain + PyMuPDF
Text Splitter preprocess.py Intelligent chunking RecursiveCharacterTextSplitter
Embedding Engine indexer.py Vector generation Sentence Transformers
Vector Store indexer.py Similarity search FAISS

4. Understanding Responses

Response Structure

  • Main Answer: Comprehensive response to your query
  • Source Attribution: References to specific document sections
  • Relevance Scores: Confidence indicators for retrieved content

Source References

Click "๐Ÿ“š Sources" to expand and see:

  • Source document names
  • Relevant text excerpts
  • Confidence scores
  • Page numbers (when available)

5. Managing Conversations

Chat History

  • Last 5 conversations displayed by default
  • Scroll to see complete conversation history
  • Context maintained across related questions

Reset Options

  • ๐Ÿ—‘๏ธ Clear Conversation: Remove chat history only
  • ๐Ÿ”„ Reset All Data: Clear documents and conversations

Best Practices

Document Upload Tips

  1. Optimal File Sizes: 1-50 MB per PDF for best performance
  2. Text-based PDFs: Ensure PDFs contain searchable text (not just images)
  3. Multiple Documents: Upload related documents together for cross-referencing
  4. File Naming: Use descriptive names for better source attribution

Query Optimization

  1. Be Specific: "Analyze financial performance metrics" vs "Tell me about finances"
  2. Use Context: Reference previous questions for follow-up queries
  3. Specify Format: "List the top 5..." or "Provide a brief summary..."
  4. Multiple Aspects: "Compare X and Y in terms of cost, efficiency, and scalability"

Common Use Cases

Academic Research

- "Summarize the literature review section"
- "What methodologies were used in the studies?"
- "Compare the findings across different research papers"

Business Analysis

- "What are the key financial metrics presented?"
- "Analyze the market trends discussed"
- "List the recommendations for improvement"

Technical Documentation

- "Explain the system architecture"
- "What are the installation requirements?"
- "List the troubleshooting steps for issue X"

๐Ÿ› ๏ธ Developer Guide

Project Structure

pdf-knowledge-extractor/
โ”œโ”€โ”€ app.py                 # Main Streamlit application
โ”œโ”€โ”€ rag_pipeline.py        # Core RAG implementation
โ”œโ”€โ”€ indexer.py            # Vector indexing and search
โ”œโ”€โ”€ preprocess.py         # Document processing
โ”œโ”€โ”€ model_loader.py       # LLM API management
โ”œโ”€โ”€ upload_handler.py     # File upload processing
โ”œโ”€โ”€ scraper.py           # URL PDF extraction
โ”œโ”€โ”€ config.py            # Configuration management
โ”œโ”€โ”€ install_enhanced.py   # Setup and model installation
โ”œโ”€โ”€ requirements.txt      # Python dependencies
โ”œโ”€โ”€ .env                 # Environment variables
โ”œโ”€โ”€ .gitignore           # Git exclusions
โ””โ”€โ”€ README.md            # Project documentation

Development Setup

1. Development Environment

# Clone repository
git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor

Create development environment

python -m venv dev-env source dev-env/bin/activate # or dev-env\Scripts\activate on Windows

Install dependencies with development tools

pip install -r requirements.txt pip install pytest black isort flake8 mypy

2. Code Quality Tools

# Format code
black . --line-length 88
isort . --profile black

Lint code

flake8 . --max-line-length 88 --ignore E203,W503

Type checking

mypy . --ignore-missing-imports

3. Testing

# Run tests
python -m pytest tests/ -v

Test coverage

python -m pytest tests/ --cov=. --cov-report=html

Core Architecture

LangChain Integration

Document Processing Chain

# preprocess.py - Document loading and splitting
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyMuPDFLoader(file_path) documents = loader.load()

splitter = RecursiveCharacterTextSplitter( chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP, separators=["\n\n", "\n", ". ", " ", ""] ) chunks = splitter.split_documents(documents)

Vector Store Creation

# indexer.py - FAISS vector store
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME) vector_store = FAISS.from_documents(documents, embeddings)

RAG Pipeline

# rag_pipeline.py - Complete RAG implementation
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

Create retriever

retriever = vector_store.as_retriever(search_kwargs={"k": TOP_K})

Custom prompt template

prompt_template = PromptTemplate( template="Context: {context}\nQuestion: {question}\nAnswer:", input_variables=["context", "question"] )

Build QA chain

qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=retriever, chain_type_kwargs={"prompt": prompt_template} )

Custom Components

Intent Analysis Engine

def analyze_query_intent(query: str) -> str:
    """Classify query type for optimized processing"""
    intent_patterns = {
        'summary': ['summarize', 'overview', 'main points'],
        'analysis': ['analyze', 'evaluate', 'assessment'],
        'comparison': ['compare', 'contrast', 'versus'],
        'explanation': ['explain', 'how', 'why', 'what is']
    }
query_lower = query.lower()
for intent, patterns in intent_patterns.items():
    if any(pattern in query_lower for pattern in patterns):
        return intent
return 'general'

Multi-Strategy Retrieval

def retrieve_chunks_advanced(query: str, index, strategy: str = "hybrid"):
    """Advanced retrieval with multiple strategies"""
    if strategy == "similarity":
        return similarity_search(query, index)
    elif strategy == "mmr":
        return mmr_search(query, index)
    elif strategy == "hybrid":
        # Combine multiple approaches
        sim_results = similarity_search(query, index)
        mmr_results = mmr_search(query, index)
        return merge_and_rank(sim_results, mmr_results)

Adding New Features

1. Custom Document Loaders

# Create new loader in preprocess.py
from langchain.document_loaders.base import BaseLoader

class CustomDocumentLoader(BaseLoader): def init(self, file_path: str): self.file_path = file_path

def load(self) -> List[Document]:
    # Implement custom loading logic
    pass

2. New Retrieval Strategies

# Add to rag_pipeline.py
def create_custom_retriever(vector_store, strategy_config: Dict):
    """Create custom retrieval strategy"""
    if strategy_config["type"] == "custom":
        return CustomRetriever(
            vector_store=vector_store,
            **strategy_config["params"]
        )

3. Enhanced UI Components

# Add to app.py
def display_custom_component():
    """Custom Streamlit component"""
    with st.container():
        col1, col2 = st.columns(2)
        with col1:
            # Custom visualization
            pass
        with col2:
            # Interactive controls
            pass

Performance Optimization

Caching Strategies

# Streamlit caching
@st.cache_data
def load_and_process_documents():
    """Cache document processing results"""
    return process_documents()

@st.cache_resource def initialize_models(): """Cache model initialization""" return load_embedding_model()

Async Processing

import asyncio
from concurrent.futures import ThreadPoolExecutor

async def process_documents_async(documents: List[str]): """Async document processing""" loop = asyncio.get_event_loop() with ThreadPoolExecutor() as executor: tasks = [ loop.run_in_executor(executor, process_single_doc, doc) for doc in documents ] return await asyncio.gather(*tasks)

Error Handling

Graceful Degradation

def robust_processing_pipeline(documents):
    """Processing with fallback mechanisms"""
    try:
        # Try advanced LangChain processing
        return langchain_processing(documents)
    except ImportError:
        # Fallback to basic processing
        return basic_processing(documents)
    except Exception as e:
        # Log error and use minimal processing
        logger.error(f"Processing error: {e}")
        return minimal_processing(documents)

User-Friendly Error Messages

def handle_api_errors(func):
    """Decorator for API error handling"""
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except ConnectionError:
            return "๐ŸŒ Connection error. Please check your internet connection."
        except AuthenticationError:
            return "๐Ÿ”‘ API key invalid. Please check your configuration."
        except RateLimitError:
            return "โฐ Rate limit exceeded. Please wait and try again."
    return wrapper

๐Ÿš€ Deployment Guide

Local Deployment

Standard Setup

# Production environment
python -m venv prod-env
source prod-env/bin/activate
pip install -r requirements.txt

Configure production settings

cp .env.example .env

Edit .env with production values

Run application

streamlit run app.py --server.port 8501 --server.address 0.0.0.0

Docker Deployment

Dockerfile

FROM python:3.9-slim

WORKDIR /app

Install system dependencies

RUN apt-get update && apt-get install -y
build-essential
curl
&& rm -rf /var/lib/apt/lists/*

Copy requirements and install Python dependencies

COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt

Copy application code

COPY . .

Create data directory

RUN mkdir -p data/pdfs data/texts

Expose port

EXPOSE 8501

Health check

HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health

Run application

CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]

Docker Compose

version: '3.8'
services:
  rag-pdf-chat:
    build: .
    ports:
      - "8501:8501"
    environment:
      - OPENROUTER_API_KEY=${OPENROUTER_API_KEY}
    volumes:
      - ./data:/app/data
    restart: unless-stopped

Docker Commands

# Build image
docker build -t pdf-knowledge-extractor .

Run container

docker run -p 8501:8501 -e OPENROUTER_API_KEY=your_key pdf-knowledge-extractor

Using docker-compose

docker-compose up -d

Cloud Deployment

Streamlit Cloud

  1. Connect Repository

    • Link GitHub repository to Streamlit Cloud
    • Configure automatic deployments
  2. Environment Variables

    # Add in Streamlit Cloud dashboard
    OPENROUTER_API_KEY = "your_key_here"
    STREAMLIT_SERVER_MAX_UPLOAD_SIZE = 200
    
  3. Requirements

    # Ensure requirements.txt includes all dependencies
    streamlit>=1.28.0
    langchain>=0.1.0
    # ... other dependencies
    

AWS EC2 Deployment

Instance Setup

# Launch EC2 instance (Ubuntu 20.04 LTS)
# Connect via SSH

Update system

sudo apt update && sudo apt upgrade -y

Install Python and dependencies

sudo apt install python3-pip python3-venv git -y

Clone repository

git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git cd pdf-knowledge-extractor

Setup application

python3 -m venv venv source venv/bin/activate pip install -r requirements.txt

Configure environment

cp .env.example .env nano .env # Add your API key

Install and configure nginx

sudo apt install nginx -y

Nginx Configuration

# /etc/nginx/sites-available/rag-pdf-chat
server {
    listen 80;
    server_name your-domain.com;
location / {
    proxy_pass http://127.0.0.1:8501;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
}

}

Systemd Service

# /etc/systemd/system/rag-pdf-chat.service
[Unit]
Description=Enhanced RAG PDF Chat
After=network.target

[Service] Type=simple User=ubuntu WorkingDirectory=/home/ubuntu/pdf-knowledge-extractor Environment=PATH=/home/ubuntu/pdf-knowledge-extractor/venv/bin ExecStart=/home/ubuntu/pdf-knowledge-extractor/venv/bin/streamlit run app.py --server.port 8501 Restart=always

[Install] WantedBy=multi-user.target

Service Management

# Enable and start service
sudo systemctl enable rag-pdf-chat
sudo systemctl start rag-pdf-chat

Check status

sudo systemctl status rag-pdf-chat

View logs

sudo journalctl -u rag-pdf-chat -f

Heroku Deployment

Procfile

web: streamlit run app.py --server.port=$PORT --server.address=0.0.0.0

Heroku Setup

# Install Heroku CLI and login
heroku login

Create app

heroku create your-app-name

Set environment variables

heroku config:set OPENROUTER_API_KEY=your_key

Deploy

git push heroku main

Production Considerations

Security

  1. Environment Variables

    # Never commit .env files
    # Use secure environment variable management
    export OPENROUTER_API_KEY="secure_key_here"
    
  2. HTTPS Configuration

    # Use SSL certificates
    sudo certbot --nginx -d your-domain.com
    
  3. Access Control

    # Add authentication if needed
    import streamlit_authenticator as stauth
    

Performance Optimization

  1. Resource Limits

    # config.py - Production settings
    MAX_UPLOAD_SIZE = 50 * 1024 * 1024  # 50MB
    MAX_CONCURRENT_USERS = 10
    CACHE_TTL = 3600  # 1 hour
    
  2. Monitoring

    # Add monitoring and logging
    import logging
    

    logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' )

Scaling Considerations

  1. Horizontal Scaling

    • Use load balancers for multiple instances
    • Implement session state management
    • Consider Redis for shared caching
  2. Vertical Scaling

    • Optimize memory usage
    • Use GPU instances for large models
    • Implement efficient caching strategies

โ“ Troubleshooting

Common Issues

Installation Problems

Issue: pip install fails

# Solution 1: Upgrade pip
python -m pip install --upgrade pip

Solution 2: Use specific index

pip install -r requirements.txt -i https://pypi.org/simple/

Solution 3: Install specific problematic packages

pip install torch --index-url https://download.pytorch.org/whl/cpu

Issue: LangChain import errors

# Check LangChain version
pip show langchain

Reinstall with specific version

pip uninstall langchain langchain-community pip install langchain==0.1.0 langchain-community==0.0.10

Issue: FAISS installation fails

# Windows specific
pip install faiss-cpu --no-cache-dir

macOS specific

conda install -c conda-forge faiss-cpu

Linux specific

pip install faiss-cpu==1.7.4

Runtime Errors

Issue: "OpenRouter API key not found"

# Check environment loading
from dotenv import load_dotenv
load_dotenv()
print(os.getenv("OPENROUTER_API_KEY"))

Verify .env file format

OPENROUTER_API_KEY=your_key_here # No quotes, no spaces

Issue: "No module named 'sentence_transformers'"

# Install specific version
pip install sentence-transformers==2.2.2

Clear cache and reinstall

pip cache purge pip install sentence-transformers --no-cache-dir

Issue: "FAISS index not found"

# Check data directory structure
import os
print(os.listdir("data/"))

Reset and rebuild index

python -c " from indexer import get_index_info print(get_index_info()) "

Performance Issues

Issue: Slow document processing

# Reduce chunk size for faster processing
# In
# ๐Ÿ“š Enhanced RAG PDF Chat - Complete Documentation

Table of Contents

  1. [๐Ÿš€ Getting Started](#-getting-started)
  2. [๐Ÿ—๏ธ Architecture Overview](#๏ธ-architecture-overview)
  3. [๐Ÿ“‹ API Reference](#-api-reference)
  4. [๐Ÿ”ง Configuration Guide](#-configuration-guide)
  5. [๐Ÿ“– User Guide](#-user-guide)
  6. [๐Ÿ› ๏ธ Developer Guide](#๏ธ-developer-guide)
  7. [๐Ÿš€ Deployment Guide](#-deployment-guide)
  8. [โ“ Troubleshooting](#-troubleshooting)
  9. [๐Ÿค Contributing](#-contributing)
  10. [๐Ÿ“„ Changelog](#-changelog)

๐Ÿš€ Getting Started

Prerequisites

  • Python 3.8+ (Recommended: Python 3.9 or 3.10)
  • OpenRouter API Key (Free at [openrouter.ai](https://openrouter.ai))
  • 4GB+ RAM for optimal performance
  • Internet connection for initial model downloads

Quick Installation

1. Clone Repository

git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor

2. Environment Setup

# Create virtual environment
python -m venv venv

# Activate environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

3. Configuration

# Create environment file
echo "OPENROUTER_API_KEY=your_key_here" > .env

4. Initialize System

# Download and setup models (optional - auto-downloads on first use)
python install_enhanced.py

# Launch application
streamlit run app.py

5. Access Application

Open your browser and navigate to: http://localhost:8501

First Run Checklist

  • Python 3.8+ installed
  • Virtual environment activated
  • Dependencies installed successfully
  • API key configured in .env file
  • Application launches without errors
  • Can access web interface

๐Ÿ—๏ธ Architecture Overview

System Architecture

graph TB
    A[๐Ÿ“„ Document Input] --> B[๐Ÿ”„ LangChain Processor]
    B --> C[โœ‚๏ธ Advanced Text Splitting]
    C --> D[๐Ÿง  Sentence Transformers]
    D --> E[๐Ÿ“Š FAISS Vector Index]
    
    F[๐Ÿ’ฌ User Query] --> G[๐ŸŽฏ Intent Analysis]
    G --> H[๐Ÿ” Multi-Strategy Retrieval]
    H --> E
    E --> I[๐Ÿ“ Context Assembly]
    I --> J[๐Ÿค– DeepSeek R1 API]
    J --> K[โœจ Enhanced Response]
    
    style A fill:#e1f5fe,color:#000000
    style E fill:#f3e5f5,color:#000000
    style J fill:#e8f5e8,color:#000000
    style K fill:#fff3e0,color:#000000
Loading

Core Components

๐Ÿ“„ Document Processing Pipeline

Component File Purpose Technology
Document Loader preprocess.py PDF/Text extraction LangChain + PyMuPDF
Text Splitter preprocess.py Intelligent chunking RecursiveCharacterTextSplitter
Embedding Engine indexer.py Vector generation Sentence Transformers
Vector Store indexer.py Similarity search FAISS

๐Ÿ’ฌ Query Processing Pipeline

Component File Purpose Technology
Intent Analyzer rag_pipeline.py Query classification Custom NLP
Retriever rag_pipeline.py Context extraction Multi-strategy search
Prompt Engine rag_pipeline.py Dynamic prompting LangChain Templates
LLM Interface model_loader.py Response generation OpenRouter API

๐ŸŽจ User Interface

Component File Purpose Technology
Main App app.py Web interface Streamlit
Upload Handler upload_handler.py File processing PyMuPDF
URL Scraper scraper.py Web PDF extraction Requests + PyMuPDF
Configuration config.py Settings management Environment Variables

Data Flow

1. Document Ingestion

PDF/URL โ†’ Text Extraction โ†’ Chunking โ†’ Embedding โ†’ Vector Storage

2. Query Processing

User Query โ†’ Intent Analysis โ†’ Retrieval โ†’ Context Assembly โ†’ LLM โ†’ Response

3. Response Enhancement

Raw Response โ†’ Source Attribution โ†’ Formatting โ†’ UI Display

๐Ÿ“‹ API Reference

Core Classes

EnhancedLangChainRAG

Main RAG pipeline orchestrator.

from rag_pipeline import EnhancedLangChainRAG

rag = EnhancedLangChainRAG()

Methods

initialize_embedding_model()

Initialize sentence transformer model for embeddings.

Returns: SentenceTransformer instance

Example:

model = rag.initialize_embedding_model()
analyze_query_intent(query: str) -> str

Analyze user query to determine optimal response strategy.

Parameters:

  • query (str): User input query

Returns: Intent type ('summary', 'analysis', 'comparison', 'explanation', 'general')

Example:

intent = rag.analyze_query_intent("Summarize the main points")
# Returns: 'summary'

OptimizedLangChainIndexer

Advanced document indexing with FAISS integration.

from indexer import OptimizedLangChainIndexer

indexer = OptimizedLangChainIndexer()

Methods

create_vectorstore_fast(chunks: List[str], metadata: List[Dict]) -> tuple

Create optimized FAISS vector store.

Parameters:

  • chunks (List[str]): Text chunks to index
  • metadata (List[Dict]): Chunk metadata

Returns: (index, chunks) tuple

Example:

index, processed_chunks = indexer.create_vectorstore_fast(
    chunks=["chunk1", "chunk2"],
    metadata=[{"source": "doc1"}, {"source": "doc2"}]
)

LLMManager

OpenRouter API management and response generation.

from model_loader import LLMManager

llm = LLMManager()

Methods

generate_response(prompt: str, max_tokens: int = 1000, temperature: float = 0.7) -> str

Generate AI response using configured model.

Parameters:

  • prompt (str): Input prompt
  • max_tokens (int): Maximum response length
  • temperature (float): Response creativity (0.0-1.0)

Returns: Generated response text

Example:

response = llm.generate_response(
    prompt="Explain quantum computing",
    max_tokens=500,
    temperature=0.7
)

Core Functions

Document Processing

load_and_split_texts(strategy: str = 'hybrid') -> tuple

Load and process documents with advanced splitting.

Parameters:

  • strategy (str): Splitting strategy ('recursive', 'token_based', 'semantic', 'hybrid')

Returns: (chunks, metadata) tuple

Example:

from preprocess import load_and_split_texts

chunks, metadata = load_and_split_texts(strategy='hybrid')

build_faiss_index(chunks: List[str], metadata: List[Dict]) -> tuple

Build optimized FAISS search index.

Example:

from indexer import build_faiss_index

index, processed_chunks = build_faiss_index(chunks, metadata)

Query Processing

run_rag_pipeline(query: str, index, chunks: List[str], metadata: List[Dict], max_tokens: int = 2000) -> tuple

Execute complete RAG pipeline for query processing.

Parameters:

  • query (str): User question
  • index: FAISS index
  • chunks (List[str]): Document chunks
  • metadata (List[Dict]): Chunk metadata
  • max_tokens (int): Response length limit

Returns: (response, chunk_info) tuple

Example:

from rag_pipeline import run_rag_pipeline

response, sources = run_rag_pipeline(
    query="What are the main findings?",
    index=faiss_index,
    chunks=document_chunks,
    metadata=chunk_metadata
)

Utility Functions

get_processing_stats() -> Dict[str, Any]

Get comprehensive document processing statistics.

Example:

from preprocess import get_processing_stats

stats = get_processing_stats()
print(f"Processed {stats['files']} files with {stats['total_words']} words")

get_index_info() -> Dict[str, Any]

Get FAISS index information and status.

Example:

from indexer import get_index_info

info = get_index_info()
print(f"Index contains {info['chunk_count']} chunks")

๐Ÿ”ง Configuration Guide

Environment Variables

Required Configuration

Create a .env file in your project root:

# OpenRouter API Configuration (Required)
OPENROUTER_API_KEY=your_api_key_here
OPENROUTER_MODEL=deepseek/deepseek-r1-0528:free
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1

# Application Settings
SITE_URL=http://localhost:8501
SITE_NAME=Enhanced RAG PDF Chat

Advanced Configuration

# LangChain Text Processing
LANGCHAIN_CHUNK_SIZE=600
LANGCHAIN_CHUNK_OVERLAP=100
LANGCHAIN_MODEL_TEMPERATURE=0.7
LANGCHAIN_MAX_TOKENS=2000
LANGCHAIN_ENABLE_CACHING=true
LANGCHAIN_VERBOSE=false

# Document Processing
CHUNK_SIZE=500
CHUNK_OVERLAP=50
TOP_K=5

# Performance Settings
DEVICE=cpu

Configuration Files

config.py Structure

# Base directories
BASE_DIR = Path(__file__).parent
DATA_DIR = BASE_DIR / "data"

# Data paths
PDF_DIR = str(DATA_DIR / "pdfs")
TEXT_DIR = str(DATA_DIR / "texts")
FAISS_INDEX_PATH = str(DATA_DIR / "faiss_index")
METADATA_PATH = str(DATA_DIR / "metadata.json")

# Model settings
EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"

Directory Structure

pdf-knowledge-extractor/
โ”œโ”€โ”€ app.py                 # Main application
โ”œโ”€โ”€ config.py             # Configuration management
โ”œโ”€โ”€ data/                 # Data storage
โ”‚   โ”œโ”€โ”€ pdfs/            # Original PDF files
โ”‚   โ”œโ”€โ”€ texts/           # Extracted text files
โ”‚   โ”œโ”€โ”€ faiss_index*     # Vector index files
โ”‚   โ””โ”€โ”€ metadata.json    # Processing metadata
โ”œโ”€โ”€ .env                 # Environment variables
โ””โ”€โ”€ requirements.txt     # Dependencies

Performance Tuning

Memory Optimization

# For systems with limited RAM
LANGCHAIN_CHUNK_SIZE=400
CHUNK_SIZE=300
TOP_K=3

Processing Speed

# For faster processing
LANGCHAIN_ENABLE_CACHING=true
DEVICE=cpu  # Use 'cuda' if GPU available

Quality Settings

# For higher quality responses
LANGCHAIN_MAX_TOKENS=3000
LANGCHAIN_MODEL_TEMPERATURE=0.5
TOP_K=8

๐Ÿ“– User Guide

Getting Started

1. Document Upload

File Upload Method

  1. Navigate to the "๐Ÿ“ Add Your Documents" section
  2. Click the "๐Ÿ“Ž Upload Files" tab
  3. Drag and drop PDF files or click "Browse files"
  4. Click "๐Ÿš€ Process Documents" button
  5. Watch real-time processing progress

URL Method

  1. Click the "๐Ÿ”— From URL" tab
  2. Enter a direct PDF URL (e.g., https://example.com/document.pdf)
  3. Click "๐Ÿš€ Process Documents"
  4. System downloads and processes automatically

2. Document Processing

The system performs these steps automatically:

  1. ๐Ÿ“ฅ Processing Documents - Upload/download validation
  2. ๐Ÿ“„ Extracting Text Content - PDF text extraction
  3. โœ‚๏ธ Splitting into Sections - Intelligent chunking
  4. ๐Ÿง  Generating Embeddings - Vector creation
  5. ๐Ÿ” Building Search Index - FAISS index construction
  6. โœ… Finalizing Setup - Optimization and caching

3. Chatting with Documents

Basic Queries

- "What are the main topics discussed?"
- "Summarize the key findings"
- "List the important conclusions"

Advanced Queries

- "Compare the methodologies used in different sections"
- "Analyze the strengths and weaknesses presented"
- "Explain the relationship between X and Y concepts"

Query Types Supported

Type Example Optimization
Summary "Summarize the main points" Hierarchical organization
Analysis "Analyze the data trends" Critical evaluation focus
Comparison "Compare approach A vs B" Systematic contrasting
Explanation "Explain how this works" Step-by-step breakdown
Specific "What is the definition of X?" Precise information extraction

4. Understanding Responses

Response Structure

  • Main Answer: Comprehensive response to your query
  • Source Attribution: References to specific document sections
  • Relevance Scores: Confidence indicators for retrieved content

Source References

Click "๐Ÿ“š Sources" to expand and see:

  • Source document names
  • Relevant text excerpts
  • Confidence scores
  • Page numbers (when available)

5. Managing Conversations

Chat History

  • Last 5 conversations displayed by default
  • Scroll to see complete conversation history
  • Context maintained across related questions

Reset Options

  • ๐Ÿ—‘๏ธ Clear Conversation: Remove chat history only
  • ๐Ÿ”„ Reset All Data: Clear documents and conversations

Best Practices

Document Upload Tips

  1. Optimal File Sizes: 1-50 MB per PDF for best performance
  2. Text-based PDFs: Ensure PDFs contain searchable text (not just images)
  3. Multiple Documents: Upload related documents together for cross-referencing
  4. File Naming: Use descriptive names for better source attribution

Query Optimization

  1. Be Specific: "Analyze financial performance metrics" vs "Tell me about finances"
  2. Use Context: Reference previous questions for follow-up queries
  3. Specify Format: "List the top 5..." or "Provide a brief summary..."
  4. Multiple Aspects: "Compare X and Y in terms of cost, efficiency, and scalability"

Common Use Cases

Academic Research

- "Summarize the literature review section"
- "What methodologies were used in the studies?"
- "Compare the findings across different research papers"

Business Analysis

- "What are the key financial metrics presented?"
- "Analyze the market trends discussed"
- "List the recommendations for improvement"

Technical Documentation

- "Explain the system architecture"
- "What are the installation requirements?"
- "List the troubleshooting steps for issue X"

๐Ÿ› ๏ธ Developer Guide

Project Structure

pdf-knowledge-extractor/
โ”œโ”€โ”€ app.py                 # Main Streamlit application
โ”œโ”€โ”€ rag_pipeline.py        # Core RAG implementation
โ”œโ”€โ”€ indexer.py            # Vector indexing and search
โ”œโ”€โ”€ preprocess.py         # Document processing
โ”œโ”€โ”€ model_loader.py       # LLM API management
โ”œโ”€โ”€ upload_handler.py     # File upload processing
โ”œโ”€โ”€ scraper.py           # URL PDF extraction
โ”œโ”€โ”€ config.py            # Configuration management
โ”œโ”€โ”€ install_enhanced.py   # Setup and model installation
โ”œโ”€โ”€ requirements.txt      # Python dependencies
โ”œโ”€โ”€ .env                 # Environment variables
โ”œโ”€โ”€ .gitignore           # Git exclusions
โ””โ”€โ”€ README.md            # Project documentation

Development Setup

1. Development Environment

# Clone repository
git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor

# Create development environment
python -m venv dev-env
source dev-env/bin/activate  # or dev-env\Scripts\activate on Windows

# Install dependencies with development tools
pip install -r requirements.txt
pip install pytest black isort flake8 mypy

2. Code Quality Tools

# Format code
black . --line-length 88
isort . --profile black

# Lint code
flake8 . --max-line-length 88 --ignore E203,W503

# Type checking
mypy . --ignore-missing-imports

3. Testing

# Run tests
python -m pytest tests/ -v

# Test coverage
python -m pytest tests/ --cov=. --cov-report=html

Core Architecture

LangChain Integration

Document Processing Chain

# preprocess.py - Document loading and splitting
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyMuPDFLoader(file_path)
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)

Vector Store Creation

# indexer.py - FAISS vector store
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)
vector_store = FAISS.from_documents(documents, embeddings)

RAG Pipeline

# rag_pipeline.py - Complete RAG implementation
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Create retriever
retriever = vector_store.as_retriever(search_kwargs={"k": TOP_K})

# Custom prompt template
prompt_template = PromptTemplate(
    template="Context: {context}\nQuestion: {question}\nAnswer:",
    input_variables=["context", "question"]
)

# Build QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt_template}
)

Custom Components

Intent Analysis Engine

def analyze_query_intent(query: str) -> str:
    """Classify query type for optimized processing"""
    intent_patterns = {
        'summary': ['summarize', 'overview', 'main points'],
        'analysis': ['analyze', 'evaluate', 'assessment'],
        'comparison': ['compare', 'contrast', 'versus'],
        'explanation': ['explain', 'how', 'why', 'what is']
    }
    
    query_lower = query.lower()
    for intent, patterns in intent_patterns.items():
        if any(pattern in query_lower for pattern in patterns):
            return intent
    return 'general'

Multi-Strategy Retrieval

def retrieve_chunks_advanced(query: str, index, strategy: str = "hybrid"):
    """Advanced retrieval with multiple strategies"""
    if strategy == "similarity":
        return similarity_search(query, index)
    elif strategy == "mmr":
        return mmr_search(query, index)
    elif strategy == "hybrid":
        # Combine multiple approaches
        sim_results = similarity_search(query, index)
        mmr_results = mmr_search(query, index)
        return merge_and_rank(sim_results, mmr_results)

Adding New Features

1. Custom Document Loaders

# Create new loader in preprocess.py
from langchain.document_loaders.base import BaseLoader

class CustomDocumentLoader(BaseLoader):
    def __init__(self, file_path: str):
        self.file_path = file_path
    
    def load(self) -> List[Document]:
        # Implement custom loading logic
        pass

2. New Retrieval Strategies

# Add to rag_pipeline.py
def create_custom_retriever(vector_store, strategy_config: Dict):
    """Create custom retrieval strategy"""
    if strategy_config["type"] == "custom":
        return CustomRetriever(
            vector_store=vector_store,
            **strategy_config["params"]
        )

3. Enhanced UI Components

# Add to app.py
def display_custom_component():
    """Custom Streamlit component"""
    with st.container():
        col1, col2 = st.columns(2)
        with col1:
            # Custom visualization
            pass
        with col2:
            # Interactive controls
            pass

Performance Optimization

Caching Strategies

# Streamlit caching
@st.cache_data
def load_and_process_documents():
    """Cache document processing results"""
    return process_documents()

@st.cache_resource
def initialize_models():
    """Cache model initialization"""
    return load_embedding_model()

Async Processing

import asyncio
from concurrent.futures import ThreadPoolExecutor

async def process_documents_async(documents: List[str]):
    """Async document processing"""
    loop = asyncio.get_event_loop()
    with ThreadPoolExecutor() as executor:
        tasks = [
            loop.run_in_executor(executor, process_single_doc, doc)
            for doc in documents
        ]
        return await asyncio.gather(*tasks)

Error Handling

Graceful Degradation

def robust_processing_pipeline(documents):
    """Processing with fallback mechanisms"""
    try:
        # Try advanced LangChain processing
        return langchain_processing(documents)
    except ImportError:
        # Fallback to basic processing
        return basic_processing(documents)
    except Exception as e:
        # Log error and use minimal processing
        logger.error(f"Processing error: {e}")
        return minimal_processing(documents)

User-Friendly Error Messages

def handle_api_errors(func):
    """Decorator for API error handling"""
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except ConnectionError:
            return "๐ŸŒ Connection error. Please check your internet connection."
        except AuthenticationError:
            return "๐Ÿ”‘ API key invalid. Please check your configuration."
        except RateLimitError:
            return "โฐ Rate limit exceeded. Please wait and try again."
    return wrapper

๐Ÿš€ Deployment Guide

Local Deployment

Standard Setup

# Production environment
python -m venv prod-env
source prod-env/bin/activate
pip install -r requirements.txt

# Configure production settings
cp .env.example .env
# Edit .env with production values

# Run application
streamlit run app.py --server.port 8501 --server.address 0.0.0.0

Docker Deployment

Dockerfile

FROM python:3.9-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Create data directory
RUN mkdir -p data/pdfs data/texts

# Expose port
EXPOSE 8501

# Health check
HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health

# Run application
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]

Docker Compose

version: '3.8'
services:
  rag-pdf-chat:
    build: .
    ports:
      - "8501:8501"
    environment:
      - OPENROUTER_API_KEY=${OPENROUTER_API_KEY}
    volumes:
      - ./data:/app/data
    restart: unless-stopped

Docker Commands

# Build image
docker build -t pdf-knowledge-extractor .

# Run container
docker run -p 8501:8501 -e OPENROUTER_API_KEY=your_key pdf-knowledge-extractor

# Using docker-compose
docker-compose up -d

Cloud Deployment

Streamlit Cloud

  1. Connect Repository

    • Link GitHub repository to Streamlit Cloud
    • Configure automatic deployments
  2. Environment Variables

    # Add in Streamlit Cloud dashboard
    OPENROUTER_API_KEY = "your_key_here"
    STREAMLIT_SERVER_MAX_UPLOAD_SIZE = 200
  3. Requirements

    # Ensure requirements.txt includes all dependencies
    streamlit>=1.28.0
    langchain>=0.1.0
    # ... other dependencies

AWS EC2 Deployment

Instance Setup

# Launch EC2 instance (Ubuntu 20.04 LTS)
# Connect via SSH

# Update system
sudo apt update && sudo apt upgrade -y

# Install Python and dependencies
sudo apt install python3-pip python3-venv git -y

# Clone repository
git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor

# Setup application
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Configure environment
cp .env.example .env
nano .env  # Add your API key

# Install and configure nginx
sudo apt install nginx -y

Nginx Configuration

# /etc/nginx/sites-available/rag-pdf-chat
server {
    listen 80;
    server_name your-domain.com;
    
    location / {
        proxy_pass http://127.0.0.1:8501;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Systemd Service

# /etc/systemd/system/rag-pdf-chat.service
[Unit]
Description=Enhanced RAG PDF Chat
After=network.target

[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/pdf-knowledge-extractor
Environment=PATH=/home/ubuntu/pdf-knowledge-extractor/venv/bin
ExecStart=/home/ubuntu/pdf-knowledge-extractor/venv/bin/streamlit run app.py --server.port 8501
Restart=always

[Install]
WantedBy=multi-user.target

Service Management

# Enable and start service
sudo systemctl enable rag-pdf-chat
sudo systemctl start rag-pdf-chat

# Check status
sudo systemctl status rag-pdf-chat

# View logs
sudo journalctl -u rag-pdf-chat -f

Heroku Deployment

Procfile

web: streamlit run app.py --server.port=$PORT --server.address=0.0.0.0

Heroku Setup

# Install Heroku CLI and login
heroku login

# Create app
heroku create your-app-name

# Set environment variables
heroku config:set OPENROUTER_API_KEY=your_key

# Deploy
git push heroku main

Production Considerations

Security

  1. Environment Variables

    # Never commit .env files
    # Use secure environment variable management
    export OPENROUTER_API_KEY="secure_key_here"
  2. HTTPS Configuration

    # Use SSL certificates
    sudo certbot --nginx -d your-domain.com
  3. Access Control

    # Add authentication if needed
    import streamlit_authenticator as stauth

Performance Optimization

  1. Resource Limits

    # config.py - Production settings
    MAX_UPLOAD_SIZE = 50 * 1024 * 1024  # 50MB
    MAX_CONCURRENT_USERS = 10
    CACHE_TTL = 3600  # 1 hour
  2. Monitoring

    # Add monitoring and logging
    import logging
    
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
    )

Scaling Considerations

  1. Horizontal Scaling

    • Use load balancers for multiple instances
    • Implement session state management
    • Consider Redis for shared caching
  2. Vertical Scaling

    • Optimize memory usage
    • Use GPU instances for large models
    • Implement efficient caching strategies

โ“ Troubleshooting

Common Issues

Installation Problems

Issue: pip install fails

# Solution 1: Upgrade pip
python -m pip install --upgrade pip

# Solution 2: Use specific index
pip install -r requirements.txt -i https://pypi.org/simple/

# Solution 3: Install specific problematic packages
pip install torch --index-url https://download.pytorch.org/whl/cpu

Issue: LangChain import errors

# Check LangChain version
pip show langchain

# Reinstall with specific version
pip uninstall langchain langchain-community
pip install langchain==0.1.0 langchain-community==0.0.10

Issue: FAISS installation fails

# Windows specific
pip install faiss-cpu --no-cache-dir

# macOS specific  
conda install -c conda-forge faiss-cpu

# Linux specific
pip install faiss-cpu==1.7.4

Runtime Errors

Issue: "OpenRouter API key not found"

# Check environment loading
from dotenv import load_dotenv
load_dotenv()
print(os.getenv("OPENROUTER_API_KEY"))

# Verify .env file format
OPENROUTER_API_KEY=your_key_here  # No quotes, no spaces

Issue: "No module named 'sentence_transformers'"

# Install specific version
pip install sentence-transformers==2.2.2

# Clear cache and reinstall
pip cache purge
pip install sentence-transformers --no-cache-dir

Issue: "FAISS index not found"

# Check data directory structure
import os
print(os.listdir("data/"))

# Reset and rebuild index
python -c "
from indexer import get_index_info
print(get_index_info())
"

Performance Issues

Issue: Slow document processing

# Reduce chunk size for faster processing
# In .env file:
LANGCHAIN_CHUNK_SIZE=400
CHUNK_SIZE=300
TOP_K=3

# Or modify config.py temporarily:
CHUNK_SIZE = 300
CHUNK_OVERLAP = 30

Issue: High memory usage

# Monitor memory usage
import psutil
print(f"Memory usage: {psutil.virtual_memory().percent}%")

# Optimize settings in config.py:
DEVICE = "cpu"  # Ensure CPU-only processing
LANGCHAIN_ENABLE_CACHING = False  # Disable caching if needed

Issue: Slow query responses

# Check API latency
curl -w "@curl-format.txt" -o /dev/null -s "https://openrouter.ai/api/v1/models"

# Reduce context size
TOP_K=3  # Fewer retrieved chunks
LANGCHAIN_MAX_TOKENS=1000  # Shorter responses

UI/UX Issues

Issue: Streamlit app won't start

# Check port availability
netstat -an | grep 8501

# Try different port
streamlit run app.py --server.port 8502

# Clear Streamlit cache
streamlit cache clear

Issue: File upload not working

# Check file size limits
# In .streamlit/config.toml:
[server]
maxUploadSize = 200

# Verify file permissions
ls -la data/pdfs/
chmod 755 data/

Issue: Chat history not persisting

# Check session state
import streamlit as st
print(st.session_state)

# Clear and reinitialize
if st.button("Reset Session"):
    for key in st.session_state.keys():
        del st.session_state[key]
    st.rerun()

API and Connectivity Issues

Issue: OpenRouter API errors

# Test API connection
from model_loader import test_openrouter_connection
success, message = test_openrouter_connection()
print(f"API Status: {success}, Message: {message}")

# Check API key validity
import requests
headers = {"Authorization": f"Bearer {OPENROUTER_API_KEY}"}
response = requests.get("https://openrouter.ai/api/v1/models", headers=headers)
print(f"Status: {response.status_code}")

Issue: Rate limiting

# Implement retry logic
import time
from functools import wraps

def retry_api_call(max_retries=3, delay=1):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if "rate limit" in str(e).lower() and attempt < max_retries - 1:
                        time.sleep(delay * (2 ** attempt))  # Exponential backoff
                        continue
                    raise e
            return None
        return wrapper
    return decorator

Debugging Tools

Enable Debug Mode

# Add to config.py
DEBUG = True
LANGCHAIN_VERBOSE = True
LANGCHAIN_DEBUG = True

# Enable detailed logging
import logging
logging.basicConfig(level=logging.DEBUG)

Memory Profiling

# Install memory profiler
pip install memory-profiler

# Profile specific functions
from memory_profiler import profile

@profile
def process_large_document():
    # Your processing code
    pass

Performance Monitoring

# Add timing decorators
import time
from functools import wraps

def timer(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(f"{func.__name__} took {end - start:.2f} seconds")
        return result
    return wrapper

@timer
def slow_function():
    # Function to profile
    pass

Network Diagnostics

# Test connectivity
ping openrouter.ai
nslookup openrouter.ai

# Check DNS resolution
dig openrouter.ai

# Test HTTPS connectivity
curl -I https://openrouter.ai/api/v1/models

Error Recovery

Automatic Recovery Mechanisms

# Implement graceful fallbacks
def robust_rag_pipeline(query, fallback_enabled=True):
    try:
        return advanced_rag_pipeline(query)
    except Exception as e:
        if fallback_enabled:
            print(f"Advanced pipeline failed: {e}")
            return simple_rag_pipeline(query)
        raise e

def simple_rag_pipeline(query):
    """Simplified RAG without advanced features"""
    # Basic similarity search without LangChain
    pass

Data Recovery

# Backup and restore functions
def backup_index():
    """Backup FAISS index and metadata"""
    import shutil
    from datetime import datetime
    
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    backup_dir = f"backups/backup_{timestamp}"
    os.makedirs(backup_dir, exist_ok=True)
    
    shutil.copytree("data/", f"{backup_dir}/data/")
    print(f"Backup created: {backup_dir}")

def restore_index(backup_path):
    """Restore from backup"""
    import shutil
    if os.path.exists(backup_path):
        shutil.copytree(f"{backup_path}/data/", "data/", dirs_exist_ok=True)
        print("Restore completed")

๐Ÿค Contributing

Getting Started

Development Environment Setup

# Fork the repository on GitHub
# Clone your fork
git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor

# Add upstream remote
git remote add upstream https://github.com/originalowner/pdf-knowledge-extractor.git

# Create development environment
python -m venv dev-env
source dev-env/bin/activate  # Windows: dev-env\Scripts\activate

# Install development dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt  # Development tools

Development Dependencies

Create requirements-dev.txt:

# Testing
pytest>=7.0.0
pytest-cov>=4.0.0
pytest-mock>=3.10.0

# Code Quality
black>=23.0.0
isort>=5.12.0
flake8>=6.0.0
mypy>=1.0.0

# Documentation
sphinx>=6.0.0
sphinx-rtd-theme>=1.2.0

# Development Tools
pre-commit>=3.0.0
jupyter>=1.0.0
ipython>=8.0.0

Code Standards

Python Code Style

# Use Black formatter with 88 character line length
black . --line-length 88

# Import sorting with isort
isort . --profile black

# Type hints for all functions
def process_document(file_path: str, chunk_size: int = 500) -> List[str]:
    """Process document and return chunks.
    
    Args:
        file_path: Path to the document file
        chunk_size: Size of text chunks
        
    Returns:
        List of text chunks
        
    Raises:
        FileNotFoundError: If file doesn't exist
        ValueError: If chunk_size is invalid
    """
    pass

Documentation Standards

class DocumentProcessor:
    """Advanced document processing with LangChain integration.
    
    This class handles document loading, text extraction, and chunking
    using various LangChain strategies for optimal performance.
    
    Attributes:
        strategy: Text splitting strategy ('recursive', 'semantic', etc.)
        chunk_size: Maximum size of text chunks
        
    Example:
        >>> processor = DocumentProcessor(strategy='hybrid')
        >>> chunks = processor.process_pdf('document.pdf')
        >>> print(f"Created {len(chunks)} chunks")
    """
    
    def __init__(self, strategy: str = 'hybrid', chunk_size: int = 500):
        """Initialize processor with specified strategy."""
        pass

Commit Message Format

# Format: <type>(<scope>): <description>
# Types: feat, fix, docs, style, refactor, test, chore

# Examples:
git commit -m "feat(rag): add multi-query retrieval strategy"
git commit -m "fix(ui): resolve file upload progress bar issue"
git commit -m "docs(api): update RAG pipeline documentation"
git commit -m "refactor(indexer): optimize FAISS index creation"

Contribution Types

๐Ÿ› Bug Reports

Use this template for bug reports:

## Bug Report

**Describe the bug**
A clear description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error

**Expected behavior**
What you expected to happen.

**Screenshots**
If applicable, add screenshots.

**Environment:**
- OS: [e.g. Windows 10, macOS 12.0, Ubuntu 20.04]
- Python Version: [e.g. 3.9.7]
- Dependencies: [paste requirements.txt versions]

**Additional context**
Any other context about the problem.

โœจ Feature Requests

## Feature Request

**Is your feature request related to a problem?**
A clear description of what the problem is.

**Describe the solution you'd like**
A clear description of what you want to happen.

**Describe alternatives you've considered**
Alternative solutions or features you've considered.

**Additional context**
Add any other context, mockups, or examples.

**Implementation ideas**
If you have ideas about how to implement this feature.

๐Ÿš€ Pull Requests

PR Template

## Pull Request

**Description**
Brief description of changes made.

**Type of change**
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Documentation update

**Testing**
- [ ] Tests pass locally
- [ ] Added tests for new functionality
- [ ] Manual testing completed

**Checklist**
- [ ] Code follows project style guidelines
- [ ] Self-review completed
- [ ] Code is commented appropriately
- [ ] Documentation updated
- [ ] No new warnings introduced

PR Guidelines

  1. Branch Naming

    # Feature branches
    git checkout -b feature/add-multi-modal-support
    
    # Bug fix branches  
    git checkout -b fix/resolve-memory-leak
    
    # Documentation branches
    git checkout -b docs/update-api-reference
  2. Development Workflow

    # Update main branch
    git checkout main
    git pull upstream main
    
    # Create feature branch
    git checkout -b feature/your-feature-name
    
    # Make changes and commit
    git add .
    git commit -m "feat(component): add new functionality"
    
    # Push to your fork
    git push origin feature/your-feature-name
    
    # Create pull request on GitHub
  3. Code Review Process

    • All PRs require at least one review
    • Address reviewer comments promptly
    • Update documentation for new features
    • Ensure tests pass before requesting review

Testing Guidelines

Unit Tests

# tests/test_rag_pipeline.py
import pytest
from unittest.mock import Mock, patch
from rag_pipeline import EnhancedLangChainRAG

class TestEnhancedLangChainRAG:
    def setup_method(self):
        """Setup test fixtures."""
        self.rag = EnhancedLangChainRAG()
    
    def test_analyze_query_intent_summary(self):
        """Test query intent analysis for summary queries."""
        query = "Summarize the main points of this document"
        intent = self.rag.analyze_query_intent(query)
        assert intent == "summary"
    
    @patch('rag_pipeline.generate_response')
    def test_run_rag_pipeline_success(self, mock_generate):
        """Test successful RAG pipeline execution."""
        mock_generate.return_value = "Test response"
        
        # Mock dependencies
        mock_index = Mock()
        chunks = ["chunk1", "chunk2"]
        metadata = [{"source": "doc1"}, {"source": "doc2"}]
        
        response, chunk_info = self.rag.run_rag_pipeline(
            "test query", mock_index, chunks, metadata
        )
        
        assert response == "Test response"
        assert len(chunk_info) > 0

Integration Tests

# tests/test_integration.py
import tempfile
import os
from app import main
from indexer import build_faiss_index
from preprocess import load_and_split_texts

class TestIntegration:
    def test_end_to_end_pipeline(self):
        """Test complete document processing pipeline."""
        with tempfile.TemporaryDirectory() as temp_dir:
            # Create test PDF
            test_pdf_path = os.path.join(temp_dir, "test.pdf")
            self.create_test_pdf(test_pdf_path)
            
            # Process document
            chunks, metadata = load_and_split_texts()
            assert len(chunks) > 0
            
            # Build index
            index, processed_chunks = build_faiss_index(chunks, metadata)
            assert index is not None
            
            # Test query
            from rag_pipeline import run_rag_pipeline
            response, sources = run_rag_pipeline(
                "What is this document about?", 
                index, chunks, metadata
            )
            assert len(response) > 0

Running Tests

# Run all tests
python -m pytest tests/ -v

# Run specific test file
python -m pytest tests/test_rag_pipeline.py -v

# Run with coverage
python -m pytest tests/ --cov=. --cov-report=html

# Run integration tests only
python -m pytest tests/test_integration.py -v

Documentation Contributions

API Documentation

# Use Google-style docstrings
def retrieve_chunks_advanced(
    query: str, 
    index: Any, 
    chunks: List[str], 
    metadata: Optional[List[Dict]] = None,
    top_k: int = 5
) -> Tuple[List[str], List[Dict]]:
    """Retrieve document chunks using advanced strategies.
    
    Performs similarity search with multiple retrieval strategies
    including semantic search, MMR, and hybrid approaches.
    
    Args:
        query: User query string for semantic search
        index: FAISS index for similarity computation
        chunks: List of document text chunks
        metadata: Optional metadata for each chunk
        top_k: Number of top chunks to retrieve
        
    Returns:
        Tuple containing:
            - List of retrieved text chunks
            - List of chunk metadata with relevance scores
            
    Raises:
        ValueError: If query is empty or index is invalid
        IndexError: If top_k exceeds available chunks
        
    Example:
        >>> chunks, info = retrieve_chunks_advanced(
        ...     "What is machine learning?",
        ...     faiss_index,
        ...     document_chunks,
        ...     chunk_metadata,
        ...     top_k=5
        ... )
        >>> print(f"Retrieved {len(chunks)} relevant chunks")
    """

README Updates

When adding features, update the README:

  1. Feature list - Add new capabilities
  2. Installation - Update if new dependencies
  3. Usage examples - Show new functionality
  4. Configuration - Document new settings

Wiki Documentation

Maintain these wiki pages:

  • API Reference - Complete function documentation
  • Architecture Guide - System design and components
  • Deployment Guide - Production deployment instructions
  • Troubleshooting - Common issues and solutions
  • Contributing Guide - This document
  • Changelog - Version history and breaking changes

๐Ÿ“„ Changelog

Version 1.0.0 (2024-01-15)

๐Ÿš€ Initial Release

Core Features

  • Enhanced RAG Pipeline: Complete LangChain integration with advanced retrieval strategies
  • Multi-Strategy Text Processing: Recursive, token-based, semantic, and hybrid chunking
  • FAISS Vector Search: Optimized similarity search with contextual compression
  • Intent-Based Query Analysis: Automatic query classification for optimized responses
  • Real-Time UI: Animated Streamlit interface with progress tracking

Architecture Components

  • Document Processing: Advanced PDF loading with PyMuPDF and LangChain loaders
  • Vector Indexing: High-performance FAISS indexing with sentence-transformers
  • API Integration: OpenRouter API management with DeepSeek R1 model
  • Configuration Management: Comprehensive environment variable system
  • Error Handling: Graceful fallbacks and comprehensive error recovery

User Interface

  • Gradient Styling: Modern CSS with animated elements
  • Progress Tracking: Real-time processing feedback with step indicators
  • Source Attribution: Expandable source references with relevance scoring
  • Conversation Memory: Context-aware chat with history management

Technical Highlights

  • Performance Optimization: Parallel processing and intelligent caching
  • Cross-Platform: Windows, macOS, and Linux compatibility
  • Modular Design: Clean separation of concerns for easy extension
  • Production Ready: Comprehensive error handling and monitoring

๐Ÿ”ง Configuration

  • Environment variable management with .env support
  • Configurable chunk sizes and overlap parameters
  • Adjustable retrieval strategies and model parameters
  • Customizable UI themes and animations

๐Ÿ“š Documentation

  • Complete API reference with examples
  • Architecture diagrams and component documentation
  • Deployment guides for local and cloud environments
  • Comprehensive troubleshooting and FAQ sections

๐Ÿงช Testing

  • Unit tests for core components
  • Integration tests for end-to-end workflows
  • Performance benchmarks and optimization guidelines
  • Error handling validation

Version History Template

Version X.Y.Z (YYYY-MM-DD)

โœจ New Features

  • Feature description with technical details
  • User-facing improvements and capabilities

๐Ÿ› Bug Fixes

  • Issue resolution with root cause analysis
  • Performance improvements and optimizations

๐Ÿ”ง Technical Changes

  • Architecture updates and refactoring
  • Dependency updates and compatibility improvements

๐Ÿ“– Documentation

  • New documentation sections
  • Updated examples and tutorials

โš ๏ธ Breaking Changes

  • API changes that require user action
  • Configuration changes and migration guides

๐Ÿ—‘๏ธ Deprecated

  • Features marked for removal in future versions
  • Migration paths for deprecated functionality

Future Roadmap

Version 1.1.0 (Planned)

  • Multi-Modal Support: Image and table extraction from PDFs
  • Advanced Analytics: Document insights and visualization
  • Batch Processing: Multiple document upload optimization
  • Export Features: Conversation and insight export functionality

Version 1.2.0 (Planned)

  • API Endpoints: RESTful API for programmatic access
  • Authentication: User management and access control
  • Cloud Storage: Integration with AWS S3, Google Drive
  • Advanced Search: Full-text search and filtering capabilities

Version 2.0.0 (Future)

  • Multi-Language Support: International language processing
  • Custom Models: Support for local LLM deployment
  • Enterprise Features: Advanced security and compliance
  • Mobile App: Native mobile application development

This comprehensive documentation provides everything needed to understand, use, develop, and deploy the Enhanced RAG PDF Chat application. Each section is designed to be self-contained while linking to related concepts throughout the documentation.

For the latest updates and additional resources, visit the [GitHub repository](https://github.com/saktisriraj/pdf-knowledge-extractor) and [project wiki](https://github.com/saktisriraj/pdf-knowledge-extractor/wiki).

โš ๏ธ **GitHub.com Fallback** โš ๏ธ