📚 Enhanced RAG PDF Chat - Complete Documentation

🚀 Getting Started
🏗️ Architecture Overview
📋 API Reference
🔧 Configuration Guide
📖 User Guide
🛠️ Developer Guide
🚀 Deployment Guide
❓ Troubleshooting
🤝 Contributing
📄 Changelog

🚀 Getting Started

Prerequisites

Python 3.8+ (Recommended: Python 3.9 or 3.10)
OpenRouter API Key (Free at openrouter.ai)
4GB+ RAM for optimal performance
Internet connection for initial model downloads

Quick Installation

1. Clone Repository

git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor

2. Environment Setup

# Create virtual environment
python -m venv venv
Activate environment
Windows:
venv\Scripts\activate
macOS/Linux:
source venv/bin/activate
Install dependencies
pip install -r requirements.txt

3. Configuration

# Create environment file
echo "OPENROUTER_API_KEY=your_key_here" > .env

4. Initialize System

# Download and setup models (optional - auto-downloads on first use)
python install_enhanced.py
Launch application
streamlit run app.py

5. Access Application

Open your browser and navigate to: http://localhost:8501

First Run Checklist

Python 3.8+ installed
Virtual environment activated
Dependencies installed successfully
API key configured in .env file
Application launches without errors
Can access web interface

🏗️ Architecture Overview

System Architecture

graph TB
    A[📄 Document Input] --> B[🔄 LangChain Processor]
    B --> C[✂️ Advanced Text Splitting]
    C --> D[🧠 Sentence Transformers]
    D --> E[📊 FAISS Vector Index]
F[💬 User Query] --&gt; G[🎯 Intent Analysis]
G --&gt; H[🔍 Multi-Strategy Retrieval]
H --&gt; E
E --&gt; I[📝 Context Assembly]
I --&gt; J[🤖 DeepSeek R1 API]
J --&gt; K[✨ Enhanced Response]

style A fill:#e1f5fe,color:#000000
style E fill:#f3e5f5,color:#000000
style J fill:#e8f5e8,color:#000000
style K fill:#fff3e0,color:#000000

Core Components

📄 Document Processing Pipeline

Component	File	Purpose	Technology
Document Loader	preprocess.py	PDF/Text extraction	LangChain + PyMuPDF
Text Splitter	preprocess.py	Intelligent chunking	RecursiveCharacterTextSplitter
Embedding Engine	indexer.py	Vector generation	Sentence Transformers
Vector Store	indexer.py	Similarity search	FAISS

4. Understanding Responses

Response Structure

Main Answer: Comprehensive response to your query
Source Attribution: References to specific document sections
Relevance Scores: Confidence indicators for retrieved content

Source References

Click "📚 Sources" to expand and see:

Source document names
Relevant text excerpts
Confidence scores
Page numbers (when available)

5. Managing Conversations

Chat History

Last 5 conversations displayed by default
Scroll to see complete conversation history
Context maintained across related questions

Reset Options

🗑️ Clear Conversation: Remove chat history only
🔄 Reset All Data: Clear documents and conversations

Best Practices

Document Upload Tips

Optimal File Sizes: 1-50 MB per PDF for best performance
Text-based PDFs: Ensure PDFs contain searchable text (not just images)
Multiple Documents: Upload related documents together for cross-referencing
File Naming: Use descriptive names for better source attribution

Query Optimization

Be Specific: "Analyze financial performance metrics" vs "Tell me about finances"
Use Context: Reference previous questions for follow-up queries
Specify Format: "List the top 5..." or "Provide a brief summary..."
Multiple Aspects: "Compare X and Y in terms of cost, efficiency, and scalability"

Common Use Cases

Academic Research

- "Summarize the literature review section"
- "What methodologies were used in the studies?"
- "Compare the findings across different research papers"

Business Analysis

- "What are the key financial metrics presented?"
- "Analyze the market trends discussed"
- "List the recommendations for improvement"

Technical Documentation

- "Explain the system architecture"
- "What are the installation requirements?"
- "List the troubleshooting steps for issue X"

🛠️ Developer Guide

Project Structure

pdf-knowledge-extractor/
├── app.py                 # Main Streamlit application
├── rag_pipeline.py        # Core RAG implementation
├── indexer.py            # Vector indexing and search
├── preprocess.py         # Document processing
├── model_loader.py       # LLM API management
├── upload_handler.py     # File upload processing
├── scraper.py           # URL PDF extraction
├── config.py            # Configuration management
├── install_enhanced.py   # Setup and model installation
├── requirements.txt      # Python dependencies
├── .env                 # Environment variables
├── .gitignore           # Git exclusions
└── README.md            # Project documentation

Development Setup

1. Development Environment

# Clone repository
git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor
Create development environment
python -m venv dev-env
source dev-env/bin/activate  # or dev-env\Scripts\activate on Windows
Install dependencies with development tools
pip install -r requirements.txt
pip install pytest black isort flake8 mypy

2. Code Quality Tools

# Format code
black . --line-length 88
isort . --profile black
Lint code
flake8 . --max-line-length 88 --ignore E203,W503
Type checking
mypy . --ignore-missing-imports

3. Testing

# Run tests
python -m pytest tests/ -v
Test coverage
python -m pytest tests/ --cov=. --cov-report=html

Core Architecture

LangChain Integration

Document Processing Chain

# preprocess.py - Document loading and splitting
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = PyMuPDFLoader(file_path)
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)

Vector Store Creation

# indexer.py - FAISS vector store
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)
vector_store = FAISS.from_documents(documents, embeddings)

RAG Pipeline

# rag_pipeline.py - Complete RAG implementation
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
Create retriever
retriever = vector_store.as_retriever(search_kwargs={"k": TOP_K})
Custom prompt template
prompt_template = PromptTemplate(
template="Context: {context}\nQuestion: {question}\nAnswer:",
input_variables=["context", "question"]
)
Build QA chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": prompt_template}
)

Custom Components

Intent Analysis Engine

def analyze_query_intent(query: str) -> str:
    """Classify query type for optimized processing"""
    intent_patterns = {
        'summary': ['summarize', 'overview', 'main points'],
        'analysis': ['analyze', 'evaluate', 'assessment'],
        'comparison': ['compare', 'contrast', 'versus'],
        'explanation': ['explain', 'how', 'why', 'what is']
    }
query_lower = query.lower()
for intent, patterns in intent_patterns.items():
    if any(pattern in query_lower for pattern in patterns):
        return intent
return 'general'

Multi-Strategy Retrieval

def retrieve_chunks_advanced(query: str, index, strategy: str = "hybrid"):
    """Advanced retrieval with multiple strategies"""
    if strategy == "similarity":
        return similarity_search(query, index)
    elif strategy == "mmr":
        return mmr_search(query, index)
    elif strategy == "hybrid":
        # Combine multiple approaches
        sim_results = similarity_search(query, index)
        mmr_results = mmr_search(query, index)
        return merge_and_rank(sim_results, mmr_results)

Adding New Features

1. Custom Document Loaders

# Create new loader in preprocess.py
from langchain.document_loaders.base import BaseLoader
class CustomDocumentLoader(BaseLoader):
def init(self, file_path: str):
self.file_path = file_path
def load(self) -&gt; List[Document]:
    # Implement custom loading logic
    pass

2. New Retrieval Strategies

# Add to rag_pipeline.py
def create_custom_retriever(vector_store, strategy_config: Dict):
    """Create custom retrieval strategy"""
    if strategy_config["type"] == "custom":
        return CustomRetriever(
            vector_store=vector_store,
            **strategy_config["params"]
        )

3. Enhanced UI Components

# Add to app.py
def display_custom_component():
    """Custom Streamlit component"""
    with st.container():
        col1, col2 = st.columns(2)
        with col1:
            # Custom visualization
            pass
        with col2:
            # Interactive controls
            pass

Performance Optimization

Caching Strategies

# Streamlit caching
@st.cache_data
def load_and_process_documents():
    """Cache document processing results"""
    return process_documents()
@st.cache_resource
def initialize_models():
"""Cache model initialization"""
return load_embedding_model()

Async Processing

import asyncio
from concurrent.futures import ThreadPoolExecutor
async def process_documents_async(documents: List[str]):
"""Async document processing"""
loop = asyncio.get_event_loop()
with ThreadPoolExecutor() as executor:
tasks = [
loop.run_in_executor(executor, process_single_doc, doc)
for doc in documents
]
return await asyncio.gather(*tasks)

Error Handling

Graceful Degradation

def robust_processing_pipeline(documents):
    """Processing with fallback mechanisms"""
    try:
        # Try advanced LangChain processing
        return langchain_processing(documents)
    except ImportError:
        # Fallback to basic processing
        return basic_processing(documents)
    except Exception as e:
        # Log error and use minimal processing
        logger.error(f"Processing error: {e}")
        return minimal_processing(documents)

User-Friendly Error Messages

def handle_api_errors(func):
    """Decorator for API error handling"""
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except ConnectionError:
            return "🌐 Connection error. Please check your internet connection."
        except AuthenticationError:
            return "🔑 API key invalid. Please check your configuration."
        except RateLimitError:
            return "⏰ Rate limit exceeded. Please wait and try again."
    return wrapper

🚀 Deployment Guide

Local Deployment

Standard Setup

# Production environment
python -m venv prod-env
source prod-env/bin/activate
pip install -r requirements.txt
Configure production settings
cp .env.example .env
Edit .env with production values
Run application
streamlit run app.py --server.port 8501 --server.address 0.0.0.0

Docker Deployment

Dockerfile

FROM python:3.9-slim
WORKDIR /app
Install system dependencies
RUN apt-get update && apt-get install -y 

build-essential 

curl 

&& rm -rf /var/lib/apt/lists/*
Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
Copy application code
COPY . .
Create data directory
RUN mkdir -p data/pdfs data/texts
Expose port
EXPOSE 8501
Health check
HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
Run application
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]

Docker Compose

version: '3.8'
services:
  rag-pdf-chat:
    build: .
    ports:
      - "8501:8501"
    environment:
      - OPENROUTER_API_KEY=${OPENROUTER_API_KEY}
    volumes:
      - ./data:/app/data
    restart: unless-stopped

Docker Commands

# Build image
docker build -t pdf-knowledge-extractor .
Run container
docker run -p 8501:8501 -e OPENROUTER_API_KEY=your_key pdf-knowledge-extractor
Using docker-compose
docker-compose up -d

Cloud Deployment

Streamlit Cloud

Connect Repository
- Link GitHub repository to Streamlit Cloud
- Configure automatic deployments

Environment Variables

# Add in Streamlit Cloud dashboard
OPENROUTER_API_KEY = "your_key_here"
STREAMLIT_SERVER_MAX_UPLOAD_SIZE = 200

Requirements

# Ensure requirements.txt includes all dependencies
streamlit>=1.28.0
langchain>=0.1.0
# ... other dependencies

AWS EC2 Deployment

Instance Setup

# Launch EC2 instance (Ubuntu 20.04 LTS)
# Connect via SSH
Update system
sudo apt update && sudo apt upgrade -y
Install Python and dependencies
sudo apt install python3-pip python3-venv git -y
Clone repository
git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor
Setup application
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Configure environment
cp .env.example .env
nano .env  # Add your API key
Install and configure nginx
sudo apt install nginx -y

Nginx Configuration

# /etc/nginx/sites-available/rag-pdf-chat
server {
    listen 80;
    server_name your-domain.com;
location / {
    proxy_pass http://127.0.0.1:8501;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
}

}

Systemd Service

# /etc/systemd/system/rag-pdf-chat.service
[Unit]
Description=Enhanced RAG PDF Chat
After=network.target
[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/pdf-knowledge-extractor
Environment=PATH=/home/ubuntu/pdf-knowledge-extractor/venv/bin
ExecStart=/home/ubuntu/pdf-knowledge-extractor/venv/bin/streamlit run app.py --server.port 8501
Restart=always
[Install]
WantedBy=multi-user.target

Service Management

# Enable and start service
sudo systemctl enable rag-pdf-chat
sudo systemctl start rag-pdf-chat
Check status
sudo systemctl status rag-pdf-chat
View logs
sudo journalctl -u rag-pdf-chat -f

Heroku Deployment

Procfile

web: streamlit run app.py --server.port=$PORT --server.address=0.0.0.0

Heroku Setup

# Install Heroku CLI and login
heroku login
Create app
heroku create your-app-name
Set environment variables
heroku config:set OPENROUTER_API_KEY=your_key
Deploy
git push heroku main

Production Considerations

Security

Environment Variables

# Never commit .env files
# Use secure environment variable management
export OPENROUTER_API_KEY="secure_key_here"

HTTPS Configuration

# Use SSL certificates
sudo certbot --nginx -d your-domain.com

Access Control

# Add authentication if needed
import streamlit_authenticator as stauth

Performance Optimization

Resource Limits

# config.py - Production settings
MAX_UPLOAD_SIZE = 50 * 1024 * 1024  # 50MB
MAX_CONCURRENT_USERS = 10
CACHE_TTL = 3600  # 1 hour

Monitoring

# Add monitoring and logging
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

Scaling Considerations

Horizontal Scaling
- Use load balancers for multiple instances
- Implement session state management
- Consider Redis for shared caching
Vertical Scaling
- Optimize memory usage
- Use GPU instances for large models
- Implement efficient caching strategies

❓ Troubleshooting

Common Issues

Installation Problems

Issue: `pip install` fails

# Solution 1: Upgrade pip
python -m pip install --upgrade pip
Solution 2: Use specific index
pip install -r requirements.txt -i https://pypi.org/simple/
Solution 3: Install specific problematic packages
pip install torch --index-url https://download.pytorch.org/whl/cpu

Issue: LangChain import errors

# Check LangChain version
pip show langchain
Reinstall with specific version
pip uninstall langchain langchain-community
pip install langchain==0.1.0 langchain-community==0.0.10

Issue: FAISS installation fails

# Windows specific
pip install faiss-cpu --no-cache-dir
macOS specific
conda install -c conda-forge faiss-cpu
Linux specific
pip install faiss-cpu==1.7.4

Runtime Errors

Issue: "OpenRouter API key not found"

# Check environment loading
from dotenv import load_dotenv
load_dotenv()
print(os.getenv("OPENROUTER_API_KEY"))
Verify .env file format
OPENROUTER_API_KEY=your_key_here  # No quotes, no spaces

Issue: "No module named 'sentence_transformers'"

# Install specific version
pip install sentence-transformers==2.2.2
Clear cache and reinstall
pip cache purge
pip install sentence-transformers --no-cache-dir

Issue: "FAISS index not found"

# Check data directory structure
import os
print(os.listdir("data/"))
Reset and rebuild index
python -c "
from indexer import get_index_info
print(get_index_info())
"

Performance Issues

Issue: Slow document processing

# Reduce chunk size for faster processing
# In

# 📚 Enhanced RAG PDF Chat - Complete Documentation

[🚀 Getting Started](#-getting-started)
[🏗️ Architecture Overview](#️-architecture-overview)
[📋 API Reference](#-api-reference)
[🔧 Configuration Guide](#-configuration-guide)
[📖 User Guide](#-user-guide)
[🛠️ Developer Guide](#️-developer-guide)
[🚀 Deployment Guide](#-deployment-guide)
[❓ Troubleshooting](#-troubleshooting)
[🤝 Contributing](#-contributing)
[📄 Changelog](#-changelog)

🚀 Getting Started

Prerequisites

Python 3.8+ (Recommended: Python 3.9 or 3.10)
OpenRouter API Key (Free at [openrouter.ai](https://openrouter.ai))
4GB+ RAM for optimal performance
Internet connection for initial model downloads

Quick Installation

1. Clone Repository

git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor

2. Environment Setup

# Create virtual environment
python -m venv venv

# Activate environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

3. Configuration

# Create environment file
echo "OPENROUTER_API_KEY=your_key_here" > .env

4. Initialize System

# Download and setup models (optional - auto-downloads on first use)
python install_enhanced.py

# Launch application
streamlit run app.py

5. Access Application

Open your browser and navigate to: http://localhost:8501

First Run Checklist

Python 3.8+ installed
Virtual environment activated
Dependencies installed successfully
API key configured in .env file
Application launches without errors
Can access web interface

🏗️ Architecture Overview

System Architecture

graph TB
    A[📄 Document Input] --> B[🔄 LangChain Processor]
    B --> C[✂️ Advanced Text Splitting]
    C --> D[🧠 Sentence Transformers]
    D --> E[📊 FAISS Vector Index]
    
    F[💬 User Query] --> G[🎯 Intent Analysis]
    G --> H[🔍 Multi-Strategy Retrieval]
    H --> E
    E --> I[📝 Context Assembly]
    I --> J[🤖 DeepSeek R1 API]
    J --> K[✨ Enhanced Response]
    
    style A fill:#e1f5fe,color:#000000
    style E fill:#f3e5f5,color:#000000
    style J fill:#e8f5e8,color:#000000
    style K fill:#fff3e0,color:#000000

Core Components

📄 Document Processing Pipeline

Component	File	Purpose	Technology
Document Loader	`preprocess.py`	PDF/Text extraction	LangChain + PyMuPDF
Text Splitter	`preprocess.py`	Intelligent chunking	RecursiveCharacterTextSplitter
Embedding Engine	`indexer.py`	Vector generation	Sentence Transformers
Vector Store	`indexer.py`	Similarity search	FAISS

💬 Query Processing Pipeline

Component	File	Purpose	Technology
Intent Analyzer	`rag_pipeline.py`	Query classification	Custom NLP
Retriever	`rag_pipeline.py`	Context extraction	Multi-strategy search
Prompt Engine	`rag_pipeline.py`	Dynamic prompting	LangChain Templates
LLM Interface	`model_loader.py`	Response generation	OpenRouter API

🎨 User Interface

Component	File	Purpose	Technology
Main App	`app.py`	Web interface	Streamlit
Upload Handler	`upload_handler.py`	File processing	PyMuPDF
URL Scraper	`scraper.py`	Web PDF extraction	Requests + PyMuPDF
Configuration	`config.py`	Settings management	Environment Variables

Data Flow

1. Document Ingestion

PDF/URL → Text Extraction → Chunking → Embedding → Vector Storage

2. Query Processing

User Query → Intent Analysis → Retrieval → Context Assembly → LLM → Response

3. Response Enhancement

Raw Response → Source Attribution → Formatting → UI Display

📋 API Reference

Core Classes

`EnhancedLangChainRAG`

Main RAG pipeline orchestrator.

from rag_pipeline import EnhancedLangChainRAG

rag = EnhancedLangChainRAG()

Methods

`initialize_embedding_model()`

Initialize sentence transformer model for embeddings.

Returns: SentenceTransformer instance

Example:

model = rag.initialize_embedding_model()

`analyze_query_intent(query: str) -> str`

Analyze user query to determine optimal response strategy.

Parameters:

query (str): User input query

Returns: Intent type ('summary', 'analysis', 'comparison', 'explanation', 'general')

Example:

intent = rag.analyze_query_intent("Summarize the main points")
# Returns: 'summary'

`OptimizedLangChainIndexer`

Advanced document indexing with FAISS integration.

from indexer import OptimizedLangChainIndexer

indexer = OptimizedLangChainIndexer()

Methods

`create_vectorstore_fast(chunks: List[str], metadata: List[Dict]) -> tuple`

Create optimized FAISS vector store.

Parameters:

chunks (List[str]): Text chunks to index
metadata (List[Dict]): Chunk metadata

Returns: (index, chunks) tuple

Example:

index, processed_chunks = indexer.create_vectorstore_fast(
    chunks=["chunk1", "chunk2"],
    metadata=[{"source": "doc1"}, {"source": "doc2"}]
)

`LLMManager`

OpenRouter API management and response generation.

from model_loader import LLMManager

llm = LLMManager()

Methods

`generate_response(prompt: str, max_tokens: int = 1000, temperature: float = 0.7) -> str`

Generate AI response using configured model.

Parameters:

prompt (str): Input prompt
max_tokens (int): Maximum response length
temperature (float): Response creativity (0.0-1.0)

Returns: Generated response text

Example:

response = llm.generate_response(
    prompt="Explain quantum computing",
    max_tokens=500,
    temperature=0.7
)

Core Functions

Document Processing

`load_and_split_texts(strategy: str = 'hybrid') -> tuple`

Load and process documents with advanced splitting.

Parameters:

strategy (str): Splitting strategy ('recursive', 'token_based', 'semantic', 'hybrid')

Returns: (chunks, metadata) tuple

Example:

from preprocess import load_and_split_texts

chunks, metadata = load_and_split_texts(strategy='hybrid')

`build_faiss_index(chunks: List[str], metadata: List[Dict]) -> tuple`

Build optimized FAISS search index.

Example:

from indexer import build_faiss_index

index, processed_chunks = build_faiss_index(chunks, metadata)

Query Processing

`run_rag_pipeline(query: str, index, chunks: List[str], metadata: List[Dict], max_tokens: int = 2000) -> tuple`

Execute complete RAG pipeline for query processing.

Parameters:

query (str): User question
index: FAISS index
chunks (List[str]): Document chunks
metadata (List[Dict]): Chunk metadata
max_tokens (int): Response length limit

Returns: (response, chunk_info) tuple

Example:

from rag_pipeline import run_rag_pipeline

response, sources = run_rag_pipeline(
    query="What are the main findings?",
    index=faiss_index,
    chunks=document_chunks,
    metadata=chunk_metadata
)

Utility Functions

`get_processing_stats() -> Dict[str, Any]`

Get comprehensive document processing statistics.

Example:

from preprocess import get_processing_stats

stats = get_processing_stats()
print(f"Processed {stats['files']} files with {stats['total_words']} words")

`get_index_info() -> Dict[str, Any]`

Get FAISS index information and status.

Example:

from indexer import get_index_info

info = get_index_info()
print(f"Index contains {info['chunk_count']} chunks")

🔧 Configuration Guide

Environment Variables

Required Configuration

Create a .env file in your project root:

# OpenRouter API Configuration (Required)
OPENROUTER_API_KEY=your_api_key_here
OPENROUTER_MODEL=deepseek/deepseek-r1-0528:free
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1

# Application Settings
SITE_URL=http://localhost:8501
SITE_NAME=Enhanced RAG PDF Chat

Advanced Configuration

# LangChain Text Processing
LANGCHAIN_CHUNK_SIZE=600
LANGCHAIN_CHUNK_OVERLAP=100
LANGCHAIN_MODEL_TEMPERATURE=0.7
LANGCHAIN_MAX_TOKENS=2000
LANGCHAIN_ENABLE_CACHING=true
LANGCHAIN_VERBOSE=false

# Document Processing
CHUNK_SIZE=500
CHUNK_OVERLAP=50
TOP_K=5

# Performance Settings
DEVICE=cpu

Configuration Files

`config.py` Structure

# Base directories
BASE_DIR = Path(__file__).parent
DATA_DIR = BASE_DIR / "data"

# Data paths
PDF_DIR = str(DATA_DIR / "pdfs")
TEXT_DIR = str(DATA_DIR / "texts")
FAISS_INDEX_PATH = str(DATA_DIR / "faiss_index")
METADATA_PATH = str(DATA_DIR / "metadata.json")

# Model settings
EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"

Directory Structure

pdf-knowledge-extractor/
├── app.py                 # Main application
├── config.py             # Configuration management
├── data/                 # Data storage
│   ├── pdfs/            # Original PDF files
│   ├── texts/           # Extracted text files
│   ├── faiss_index*     # Vector index files
│   └── metadata.json    # Processing metadata
├── .env                 # Environment variables
└── requirements.txt     # Dependencies

Performance Tuning

Memory Optimization

# For systems with limited RAM
LANGCHAIN_CHUNK_SIZE=400
CHUNK_SIZE=300
TOP_K=3

Processing Speed

# For faster processing
LANGCHAIN_ENABLE_CACHING=true
DEVICE=cpu  # Use 'cuda' if GPU available

Quality Settings

# For higher quality responses
LANGCHAIN_MAX_TOKENS=3000
LANGCHAIN_MODEL_TEMPERATURE=0.5
TOP_K=8

📖 User Guide

Getting Started

1. Document Upload

File Upload Method

Navigate to the "📁 Add Your Documents" section
Click the "📎 Upload Files" tab
Drag and drop PDF files or click "Browse files"
Click "🚀 Process Documents" button
Watch real-time processing progress

URL Method

Click the "🔗 From URL" tab
Enter a direct PDF URL (e.g., https://example.com/document.pdf)
Click "🚀 Process Documents"
System downloads and processes automatically

2. Document Processing

The system performs these steps automatically:

📥 Processing Documents - Upload/download validation
📄 Extracting Text Content - PDF text extraction
✂️ Splitting into Sections - Intelligent chunking
🧠 Generating Embeddings - Vector creation
🔍 Building Search Index - FAISS index construction
✅ Finalizing Setup - Optimization and caching

3. Chatting with Documents

Basic Queries

- "What are the main topics discussed?"
- "Summarize the key findings"
- "List the important conclusions"

Advanced Queries

- "Compare the methodologies used in different sections"
- "Analyze the strengths and weaknesses presented"
- "Explain the relationship between X and Y concepts"

Query Types Supported

Type	Example	Optimization
Summary	"Summarize the main points"	Hierarchical organization
Analysis	"Analyze the data trends"	Critical evaluation focus
Comparison	"Compare approach A vs B"	Systematic contrasting
Explanation	"Explain how this works"	Step-by-step breakdown
Specific	"What is the definition of X?"	Precise information extraction

4. Understanding Responses

Response Structure

Main Answer: Comprehensive response to your query
Source Attribution: References to specific document sections
Relevance Scores: Confidence indicators for retrieved content

Source References

Click "📚 Sources" to expand and see:

Source document names
Relevant text excerpts
Confidence scores
Page numbers (when available)

5. Managing Conversations

Chat History

Last 5 conversations displayed by default
Scroll to see complete conversation history
Context maintained across related questions

Reset Options

🗑️ Clear Conversation: Remove chat history only
🔄 Reset All Data: Clear documents and conversations

Best Practices

Document Upload Tips

Optimal File Sizes: 1-50 MB per PDF for best performance
Text-based PDFs: Ensure PDFs contain searchable text (not just images)
Multiple Documents: Upload related documents together for cross-referencing
File Naming: Use descriptive names for better source attribution

Query Optimization

Be Specific: "Analyze financial performance metrics" vs "Tell me about finances"
Use Context: Reference previous questions for follow-up queries
Specify Format: "List the top 5..." or "Provide a brief summary..."
Multiple Aspects: "Compare X and Y in terms of cost, efficiency, and scalability"

Common Use Cases

Academic Research

- "Summarize the literature review section"
- "What methodologies were used in the studies?"
- "Compare the findings across different research papers"

Business Analysis

- "What are the key financial metrics presented?"
- "Analyze the market trends discussed"
- "List the recommendations for improvement"

Technical Documentation

- "Explain the system architecture"
- "What are the installation requirements?"
- "List the troubleshooting steps for issue X"

🛠️ Developer Guide

Project Structure

pdf-knowledge-extractor/
├── app.py                 # Main Streamlit application
├── rag_pipeline.py        # Core RAG implementation
├── indexer.py            # Vector indexing and search
├── preprocess.py         # Document processing
├── model_loader.py       # LLM API management
├── upload_handler.py     # File upload processing
├── scraper.py           # URL PDF extraction
├── config.py            # Configuration management
├── install_enhanced.py   # Setup and model installation
├── requirements.txt      # Python dependencies
├── .env                 # Environment variables
├── .gitignore           # Git exclusions
└── README.md            # Project documentation

Development Setup

1. Development Environment

# Clone repository
git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor

# Create development environment
python -m venv dev-env
source dev-env/bin/activate  # or dev-env\Scripts\activate on Windows

# Install dependencies with development tools
pip install -r requirements.txt
pip install pytest black isort flake8 mypy

2. Code Quality Tools

# Format code
black . --line-length 88
isort . --profile black

# Lint code
flake8 . --max-line-length 88 --ignore E203,W503

# Type checking
mypy . --ignore-missing-imports

3. Testing

# Run tests
python -m pytest tests/ -v

# Test coverage
python -m pytest tests/ --cov=. --cov-report=html

Core Architecture

LangChain Integration

Document Processing Chain

# preprocess.py - Document loading and splitting
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyMuPDFLoader(file_path)
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)

Vector Store Creation

# indexer.py - FAISS vector store
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)
vector_store = FAISS.from_documents(documents, embeddings)

RAG Pipeline

# rag_pipeline.py - Complete RAG implementation
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Create retriever
retriever = vector_store.as_retriever(search_kwargs={"k": TOP_K})

# Custom prompt template
prompt_template = PromptTemplate(
    template="Context: {context}\nQuestion: {question}\nAnswer:",
    input_variables=["context", "question"]
)

# Build QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt_template}
)

Custom Components

Intent Analysis Engine

def analyze_query_intent(query: str) -> str:
    """Classify query type for optimized processing"""
    intent_patterns = {
        'summary': ['summarize', 'overview', 'main points'],
        'analysis': ['analyze', 'evaluate', 'assessment'],
        'comparison': ['compare', 'contrast', 'versus'],
        'explanation': ['explain', 'how', 'why', 'what is']
    }
    
    query_lower = query.lower()
    for intent, patterns in intent_patterns.items():
        if any(pattern in query_lower for pattern in patterns):
            return intent
    return 'general'

Multi-Strategy Retrieval

def retrieve_chunks_advanced(query: str, index, strategy: str = "hybrid"):
    """Advanced retrieval with multiple strategies"""
    if strategy == "similarity":
        return similarity_search(query, index)
    elif strategy == "mmr":
        return mmr_search(query, index)
    elif strategy == "hybrid":
        # Combine multiple approaches
        sim_results = similarity_search(query, index)
        mmr_results = mmr_search(query, index)
        return merge_and_rank(sim_results, mmr_results)

Adding New Features

1. Custom Document Loaders

# Create new loader in preprocess.py
from langchain.document_loaders.base import BaseLoader

class CustomDocumentLoader(BaseLoader):
    def __init__(self, file_path: str):
        self.file_path = file_path
    
    def load(self) -> List[Document]:
        # Implement custom loading logic
        pass

2. New Retrieval Strategies

# Add to rag_pipeline.py
def create_custom_retriever(vector_store, strategy_config: Dict):
    """Create custom retrieval strategy"""
    if strategy_config["type"] == "custom":
        return CustomRetriever(
            vector_store=vector_store,
            **strategy_config["params"]
        )

3. Enhanced UI Components

# Add to app.py
def display_custom_component():
    """Custom Streamlit component"""
    with st.container():
        col1, col2 = st.columns(2)
        with col1:
            # Custom visualization
            pass
        with col2:
            # Interactive controls
            pass

Performance Optimization

Caching Strategies

# Streamlit caching
@st.cache_data
def load_and_process_documents():
    """Cache document processing results"""
    return process_documents()

@st.cache_resource
def initialize_models():
    """Cache model initialization"""
    return load_embedding_model()

Async Processing

import asyncio
from concurrent.futures import ThreadPoolExecutor

async def process_documents_async(documents: List[str]):
    """Async document processing"""
    loop = asyncio.get_event_loop()
    with ThreadPoolExecutor() as executor:
        tasks = [
            loop.run_in_executor(executor, process_single_doc, doc)
            for doc in documents
        ]
        return await asyncio.gather(*tasks)

Error Handling

Graceful Degradation

def robust_processing_pipeline(documents):
    """Processing with fallback mechanisms"""
    try:
        # Try advanced LangChain processing
        return langchain_processing(documents)
    except ImportError:
        # Fallback to basic processing
        return basic_processing(documents)
    except Exception as e:
        # Log error and use minimal processing
        logger.error(f"Processing error: {e}")
        return minimal_processing(documents)

User-Friendly Error Messages

def handle_api_errors(func):
    """Decorator for API error handling"""
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except ConnectionError:
            return "🌐 Connection error. Please check your internet connection."
        except AuthenticationError:
            return "🔑 API key invalid. Please check your configuration."
        except RateLimitError:
            return "⏰ Rate limit exceeded. Please wait and try again."
    return wrapper

🚀 Deployment Guide

Local Deployment

Standard Setup

# Production environment
python -m venv prod-env
source prod-env/bin/activate
pip install -r requirements.txt

# Configure production settings
cp .env.example .env
# Edit .env with production values

# Run application
streamlit run app.py --server.port 8501 --server.address 0.0.0.0

Docker Deployment

Dockerfile

FROM python:3.9-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Create data directory
RUN mkdir -p data/pdfs data/texts

# Expose port
EXPOSE 8501

# Health check
HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health

# Run application
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]

Docker Compose

version: '3.8'
services:
  rag-pdf-chat:
    build: .
    ports:
      - "8501:8501"
    environment:
      - OPENROUTER_API_KEY=${OPENROUTER_API_KEY}
    volumes:
      - ./data:/app/data
    restart: unless-stopped

Docker Commands

# Build image
docker build -t pdf-knowledge-extractor .

# Run container
docker run -p 8501:8501 -e OPENROUTER_API_KEY=your_key pdf-knowledge-extractor

# Using docker-compose
docker-compose up -d

Cloud Deployment

Streamlit Cloud

Connect Repository
- Link GitHub repository to Streamlit Cloud
- Configure automatic deployments

Environment Variables

# Add in Streamlit Cloud dashboard
OPENROUTER_API_KEY = "your_key_here"
STREAMLIT_SERVER_MAX_UPLOAD_SIZE = 200

Requirements

# Ensure requirements.txt includes all dependencies
streamlit>=1.28.0
langchain>=0.1.0
# ... other dependencies

AWS EC2 Deployment

Instance Setup

# Launch EC2 instance (Ubuntu 20.04 LTS)
# Connect via SSH

# Update system
sudo apt update && sudo apt upgrade -y

# Install Python and dependencies
sudo apt install python3-pip python3-venv git -y

# Clone repository
git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor

# Setup application
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Configure environment
cp .env.example .env
nano .env  # Add your API key

# Install and configure nginx
sudo apt install nginx -y

Nginx Configuration

# /etc/nginx/sites-available/rag-pdf-chat
server {
    listen 80;
    server_name your-domain.com;
    
    location / {
        proxy_pass http://127.0.0.1:8501;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Systemd Service

# /etc/systemd/system/rag-pdf-chat.service
[Unit]
Description=Enhanced RAG PDF Chat
After=network.target

[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/pdf-knowledge-extractor
Environment=PATH=/home/ubuntu/pdf-knowledge-extractor/venv/bin
ExecStart=/home/ubuntu/pdf-knowledge-extractor/venv/bin/streamlit run app.py --server.port 8501
Restart=always

[Install]
WantedBy=multi-user.target

Service Management

# Enable and start service
sudo systemctl enable rag-pdf-chat
sudo systemctl start rag-pdf-chat

# Check status
sudo systemctl status rag-pdf-chat

# View logs
sudo journalctl -u rag-pdf-chat -f

Heroku Deployment

Procfile

web: streamlit run app.py --server.port=$PORT --server.address=0.0.0.0

Heroku Setup

# Install Heroku CLI and login
heroku login

# Create app
heroku create your-app-name

# Set environment variables
heroku config:set OPENROUTER_API_KEY=your_key

# Deploy
git push heroku main

Production Considerations

Security

Environment Variables

# Never commit .env files
# Use secure environment variable management
export OPENROUTER_API_KEY="secure_key_here"

HTTPS Configuration

# Use SSL certificates
sudo certbot --nginx -d your-domain.com

Access Control

# Add authentication if needed
import streamlit_authenticator as stauth

Performance Optimization

Resource Limits

# config.py - Production settings
MAX_UPLOAD_SIZE = 50 * 1024 * 1024  # 50MB
MAX_CONCURRENT_USERS = 10
CACHE_TTL = 3600  # 1 hour

Monitoring

# Add monitoring and logging
import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

Scaling Considerations

Horizontal Scaling
- Use load balancers for multiple instances
- Implement session state management
- Consider Redis for shared caching
Vertical Scaling
- Optimize memory usage
- Use GPU instances for large models
- Implement efficient caching strategies

❓ Troubleshooting

Common Issues

Installation Problems

Issue: `pip install` fails

# Solution 1: Upgrade pip
python -m pip install --upgrade pip

# Solution 2: Use specific index
pip install -r requirements.txt -i https://pypi.org/simple/

# Solution 3: Install specific problematic packages
pip install torch --index-url https://download.pytorch.org/whl/cpu

Issue: LangChain import errors

# Check LangChain version
pip show langchain

# Reinstall with specific version
pip uninstall langchain langchain-community
pip install langchain==0.1.0 langchain-community==0.0.10

Issue: FAISS installation fails

# Windows specific
pip install faiss-cpu --no-cache-dir

# macOS specific  
conda install -c conda-forge faiss-cpu

# Linux specific
pip install faiss-cpu==1.7.4

Runtime Errors

Issue: "OpenRouter API key not found"

# Check environment loading
from dotenv import load_dotenv
load_dotenv()
print(os.getenv("OPENROUTER_API_KEY"))

# Verify .env file format
OPENROUTER_API_KEY=your_key_here  # No quotes, no spaces

Issue: "No module named 'sentence_transformers'"

# Install specific version
pip install sentence-transformers==2.2.2

# Clear cache and reinstall
pip cache purge
pip install sentence-transformers --no-cache-dir

Issue: "FAISS index not found"

# Check data directory structure
import os
print(os.listdir("data/"))

# Reset and rebuild index
python -c "
from indexer import get_index_info
print(get_index_info())
"

Performance Issues

Issue: Slow document processing

# Reduce chunk size for faster processing
# In .env file:
LANGCHAIN_CHUNK_SIZE=400
CHUNK_SIZE=300
TOP_K=3

# Or modify config.py temporarily:
CHUNK_SIZE = 300
CHUNK_OVERLAP = 30

Issue: High memory usage

# Monitor memory usage
import psutil
print(f"Memory usage: {psutil.virtual_memory().percent}%")

# Optimize settings in config.py:
DEVICE = "cpu"  # Ensure CPU-only processing
LANGCHAIN_ENABLE_CACHING = False  # Disable caching if needed

Issue: Slow query responses

# Check API latency
curl -w "@curl-format.txt" -o /dev/null -s "https://openrouter.ai/api/v1/models"

# Reduce context size
TOP_K=3  # Fewer retrieved chunks
LANGCHAIN_MAX_TOKENS=1000  # Shorter responses

UI/UX Issues

Issue: Streamlit app won't start

# Check port availability
netstat -an | grep 8501

# Try different port
streamlit run app.py --server.port 8502

# Clear Streamlit cache
streamlit cache clear

Issue: File upload not working

# Check file size limits
# In .streamlit/config.toml:
[server]
maxUploadSize = 200

# Verify file permissions
ls -la data/pdfs/
chmod 755 data/

Issue: Chat history not persisting

# Check session state
import streamlit as st
print(st.session_state)

# Clear and reinitialize
if st.button("Reset Session"):
    for key in st.session_state.keys():
        del st.session_state[key]
    st.rerun()

API and Connectivity Issues

Issue: OpenRouter API errors

# Test API connection
from model_loader import test_openrouter_connection
success, message = test_openrouter_connection()
print(f"API Status: {success}, Message: {message}")

# Check API key validity
import requests
headers = {"Authorization": f"Bearer {OPENROUTER_API_KEY}"}
response = requests.get("https://openrouter.ai/api/v1/models", headers=headers)
print(f"Status: {response.status_code}")

Issue: Rate limiting

# Implement retry logic
import time
from functools import wraps

def retry_api_call(max_retries=3, delay=1):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if "rate limit" in str(e).lower() and attempt < max_retries - 1:
                        time.sleep(delay * (2 ** attempt))  # Exponential backoff
                        continue
                    raise e
            return None
        return wrapper
    return decorator

Debugging Tools

Enable Debug Mode

# Add to config.py
DEBUG = True
LANGCHAIN_VERBOSE = True
LANGCHAIN_DEBUG = True

# Enable detailed logging
import logging
logging.basicConfig(level=logging.DEBUG)

Memory Profiling

# Install memory profiler
pip install memory-profiler

# Profile specific functions
from memory_profiler import profile

@profile
def process_large_document():
    # Your processing code
    pass

Performance Monitoring

# Add timing decorators
import time
from functools import wraps

def timer(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(f"{func.__name__} took {end - start:.2f} seconds")
        return result
    return wrapper

@timer
def slow_function():
    # Function to profile
    pass

Network Diagnostics

# Test connectivity
ping openrouter.ai
nslookup openrouter.ai

# Check DNS resolution
dig openrouter.ai

# Test HTTPS connectivity
curl -I https://openrouter.ai/api/v1/models

Error Recovery

Automatic Recovery Mechanisms

# Implement graceful fallbacks
def robust_rag_pipeline(query, fallback_enabled=True):
    try:
        return advanced_rag_pipeline(query)
    except Exception as e:
        if fallback_enabled:
            print(f"Advanced pipeline failed: {e}")
            return simple_rag_pipeline(query)
        raise e

def simple_rag_pipeline(query):
    """Simplified RAG without advanced features"""
    # Basic similarity search without LangChain
    pass

Data Recovery

# Backup and restore functions
def backup_index():
    """Backup FAISS index and metadata"""
    import shutil
    from datetime import datetime
    
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    backup_dir = f"backups/backup_{timestamp}"
    os.makedirs(backup_dir, exist_ok=True)
    
    shutil.copytree("data/", f"{backup_dir}/data/")
    print(f"Backup created: {backup_dir}")

def restore_index(backup_path):
    """Restore from backup"""
    import shutil
    if os.path.exists(backup_path):
        shutil.copytree(f"{backup_path}/data/", "data/", dirs_exist_ok=True)
        print("Restore completed")

🤝 Contributing

Getting Started

Development Environment Setup

# Fork the repository on GitHub
# Clone your fork
git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor

# Add upstream remote
git remote add upstream https://github.com/originalowner/pdf-knowledge-extractor.git

# Create development environment
python -m venv dev-env
source dev-env/bin/activate  # Windows: dev-env\Scripts\activate

# Install development dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt  # Development tools

Development Dependencies

Create requirements-dev.txt:

# Testing
pytest>=7.0.0
pytest-cov>=4.0.0
pytest-mock>=3.10.0

# Code Quality
black>=23.0.0
isort>=5.12.0
flake8>=6.0.0
mypy>=1.0.0

# Documentation
sphinx>=6.0.0
sphinx-rtd-theme>=1.2.0

# Development Tools
pre-commit>=3.0.0
jupyter>=1.0.0
ipython>=8.0.0

Code Standards

Python Code Style

# Use Black formatter with 88 character line length
black . --line-length 88

# Import sorting with isort
isort . --profile black

# Type hints for all functions
def process_document(file_path: str, chunk_size: int = 500) -> List[str]:
    """Process document and return chunks.
    
    Args:
        file_path: Path to the document file
        chunk_size: Size of text chunks
        
    Returns:
        List of text chunks
        
    Raises:
        FileNotFoundError: If file doesn't exist
        ValueError: If chunk_size is invalid
    """
    pass

Documentation Standards

class DocumentProcessor:
    """Advanced document processing with LangChain integration.
    
    This class handles document loading, text extraction, and chunking
    using various LangChain strategies for optimal performance.
    
    Attributes:
        strategy: Text splitting strategy ('recursive', 'semantic', etc.)
        chunk_size: Maximum size of text chunks
        
    Example:
        >>> processor = DocumentProcessor(strategy='hybrid')
        >>> chunks = processor.process_pdf('document.pdf')
        >>> print(f"Created {len(chunks)} chunks")
    """
    
    def __init__(self, strategy: str = 'hybrid', chunk_size: int = 500):
        """Initialize processor with specified strategy."""
        pass

Commit Message Format

# Format: <type>(<scope>): <description>
# Types: feat, fix, docs, style, refactor, test, chore

# Examples:
git commit -m "feat(rag): add multi-query retrieval strategy"
git commit -m "fix(ui): resolve file upload progress bar issue"
git commit -m "docs(api): update RAG pipeline documentation"
git commit -m "refactor(indexer): optimize FAISS index creation"

Contribution Types

🐛 Bug Reports

Use this template for bug reports:

## Bug Report

**Describe the bug**
A clear description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error

**Expected behavior**
What you expected to happen.

**Screenshots**
If applicable, add screenshots.

**Environment:**
- OS: [e.g. Windows 10, macOS 12.0, Ubuntu 20.04]
- Python Version: [e.g. 3.9.7]
- Dependencies: [paste requirements.txt versions]

**Additional context**
Any other context about the problem.

✨ Feature Requests

## Feature Request

**Is your feature request related to a problem?**
A clear description of what the problem is.

**Describe the solution you'd like**
A clear description of what you want to happen.

**Describe alternatives you've considered**
Alternative solutions or features you've considered.

**Additional context**
Add any other context, mockups, or examples.

**Implementation ideas**
If you have ideas about how to implement this feature.

🚀 Pull Requests

PR Template

## Pull Request

**Description**
Brief description of changes made.

**Type of change**
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Documentation update

**Testing**
- [ ] Tests pass locally
- [ ] Added tests for new functionality
- [ ] Manual testing completed

**Checklist**
- [ ] Code follows project style guidelines
- [ ] Self-review completed
- [ ] Code is commented appropriately
- [ ] Documentation updated
- [ ] No new warnings introduced

PR Guidelines

Branch Naming

# Feature branches
git checkout -b feature/add-multi-modal-support

# Bug fix branches  
git checkout -b fix/resolve-memory-leak

# Documentation branches
git checkout -b docs/update-api-reference

Development Workflow

# Update main branch
git checkout main
git pull upstream main

# Create feature branch
git checkout -b feature/your-feature-name

# Make changes and commit
git add .
git commit -m "feat(component): add new functionality"

# Push to your fork
git push origin feature/your-feature-name

# Create pull request on GitHub

Code Review Process
- All PRs require at least one review
- Address reviewer comments promptly
- Update documentation for new features
- Ensure tests pass before requesting review

Testing Guidelines

Unit Tests

# tests/test_rag_pipeline.py
import pytest
from unittest.mock import Mock, patch
from rag_pipeline import EnhancedLangChainRAG

class TestEnhancedLangChainRAG:
    def setup_method(self):
        """Setup test fixtures."""
        self.rag = EnhancedLangChainRAG()
    
    def test_analyze_query_intent_summary(self):
        """Test query intent analysis for summary queries."""
        query = "Summarize the main points of this document"
        intent = self.rag.analyze_query_intent(query)
        assert intent == "summary"
    
    @patch('rag_pipeline.generate_response')
    def test_run_rag_pipeline_success(self, mock_generate):
        """Test successful RAG pipeline execution."""
        mock_generate.return_value = "Test response"
        
        # Mock dependencies
        mock_index = Mock()
        chunks = ["chunk1", "chunk2"]
        metadata = [{"source": "doc1"}, {"source": "doc2"}]
        
        response, chunk_info = self.rag.run_rag_pipeline(
            "test query", mock_index, chunks, metadata
        )
        
        assert response == "Test response"
        assert len(chunk_info) > 0

Integration Tests

# tests/test_integration.py
import tempfile
import os
from app import main
from indexer import build_faiss_index
from preprocess import load_and_split_texts

class TestIntegration:
    def test_end_to_end_pipeline(self):
        """Test complete document processing pipeline."""
        with tempfile.TemporaryDirectory() as temp_dir:
            # Create test PDF
            test_pdf_path = os.path.join(temp_dir, "test.pdf")
            self.create_test_pdf(test_pdf_path)
            
            # Process document
            chunks, metadata = load_and_split_texts()
            assert len(chunks) > 0
            
            # Build index
            index, processed_chunks = build_faiss_index(chunks, metadata)
            assert index is not None
            
            # Test query
            from rag_pipeline import run_rag_pipeline
            response, sources = run_rag_pipeline(
                "What is this document about?", 
                index, chunks, metadata
            )
            assert len(response) > 0

Running Tests

# Run all tests
python -m pytest tests/ -v

# Run specific test file
python -m pytest tests/test_rag_pipeline.py -v

# Run with coverage
python -m pytest tests/ --cov=. --cov-report=html

# Run integration tests only
python -m pytest tests/test_integration.py -v

Documentation Contributions

API Documentation

# Use Google-style docstrings
def retrieve_chunks_advanced(
    query: str, 
    index: Any, 
    chunks: List[str], 
    metadata: Optional[List[Dict]] = None,
    top_k: int = 5
) -> Tuple[List[str], List[Dict]]:
    """Retrieve document chunks using advanced strategies.
    
    Performs similarity search with multiple retrieval strategies
    including semantic search, MMR, and hybrid approaches.
    
    Args:
        query: User query string for semantic search
        index: FAISS index for similarity computation
        chunks: List of document text chunks
        metadata: Optional metadata for each chunk
        top_k: Number of top chunks to retrieve
        
    Returns:
        Tuple containing:
            - List of retrieved text chunks
            - List of chunk metadata with relevance scores
            
    Raises:
        ValueError: If query is empty or index is invalid
        IndexError: If top_k exceeds available chunks
        
    Example:
        >>> chunks, info = retrieve_chunks_advanced(
        ...     "What is machine learning?",
        ...     faiss_index,
        ...     document_chunks,
        ...     chunk_metadata,
        ...     top_k=5
        ... )
        >>> print(f"Retrieved {len(chunks)} relevant chunks")
    """

README Updates

When adding features, update the README:

Feature list - Add new capabilities
Installation - Update if new dependencies
Usage examples - Show new functionality
Configuration - Document new settings

Wiki Documentation

Maintain these wiki pages:

API Reference - Complete function documentation
Architecture Guide - System design and components
Deployment Guide - Production deployment instructions
Troubleshooting - Common issues and solutions
Contributing Guide - This document
Changelog - Version history and breaking changes

📄 Changelog

Version 1.0.0 (2024-01-15)

🚀 Initial Release

Core Features

Enhanced RAG Pipeline: Complete LangChain integration with advanced retrieval strategies
Multi-Strategy Text Processing: Recursive, token-based, semantic, and hybrid chunking
FAISS Vector Search: Optimized similarity search with contextual compression
Intent-Based Query Analysis: Automatic query classification for optimized responses
Real-Time UI: Animated Streamlit interface with progress tracking

Architecture Components

Document Processing: Advanced PDF loading with PyMuPDF and LangChain loaders
Vector Indexing: High-performance FAISS indexing with sentence-transformers
API Integration: OpenRouter API management with DeepSeek R1 model
Configuration Management: Comprehensive environment variable system
Error Handling: Graceful fallbacks and comprehensive error recovery

User Interface

Gradient Styling: Modern CSS with animated elements
Progress Tracking: Real-time processing feedback with step indicators
Source Attribution: Expandable source references with relevance scoring
Conversation Memory: Context-aware chat with history management

Technical Highlights

Performance Optimization: Parallel processing and intelligent caching
Cross-Platform: Windows, macOS, and Linux compatibility
Modular Design: Clean separation of concerns for easy extension
Production Ready: Comprehensive error handling and monitoring

🔧 Configuration

Environment variable management with .env support
Configurable chunk sizes and overlap parameters
Adjustable retrieval strategies and model parameters
Customizable UI themes and animations

📚 Documentation

Complete API reference with examples
Architecture diagrams and component documentation
Deployment guides for local and cloud environments
Comprehensive troubleshooting and FAQ sections

🧪 Testing

Unit tests for core components
Integration tests for end-to-end workflows
Performance benchmarks and optimization guidelines
Error handling validation

Version History Template

Version X.Y.Z (YYYY-MM-DD)

✨ New Features

Feature description with technical details
User-facing improvements and capabilities

🐛 Bug Fixes

Issue resolution with root cause analysis
Performance improvements and optimizations

🔧 Technical Changes

Architecture updates and refactoring
Dependency updates and compatibility improvements

📖 Documentation

New documentation sections
Updated examples and tutorials

⚠️ Breaking Changes

API changes that require user action
Configuration changes and migration guides

🗑️ Deprecated

Features marked for removal in future versions
Migration paths for deprecated functionality

Future Roadmap

Version 1.1.0 (Planned)

Multi-Modal Support: Image and table extraction from PDFs
Advanced Analytics: Document insights and visualization
Batch Processing: Multiple document upload optimization
Export Features: Conversation and insight export functionality

Version 1.2.0 (Planned)

API Endpoints: RESTful API for programmatic access
Authentication: User management and access control
Cloud Storage: Integration with AWS S3, Google Drive
Advanced Search: Full-text search and filtering capabilities

Version 2.0.0 (Future)

Multi-Language Support: International language processing
Custom Models: Support for local LLM deployment
Enterprise Features: Advanced security and compliance
Mobile App: Native mobile application development

This comprehensive documentation provides everything needed to understand, use, develop, and deploy the Enhanced RAG PDF Chat application. Each section is designed to be self-contained while linking to related concepts throughout the documentation.

For the latest updates and additional resources, visit the [GitHub repository](https://github.com/saktisriraj/pdf-knowledge-extractor) and [project wiki](https://github.com/saktisriraj/pdf-knowledge-extractor/wiki).

📚 Enhanced RAG PDF Chat ‐ Documentation - SaktiSriraj/pdf-knowledge-extractor GitHub Wiki

📚 Enhanced RAG PDF Chat - Complete Documentation

Table of Contents

🚀 Getting Started

Prerequisites

Quick Installation

1. Clone Repository

2. Environment Setup

Activate environment

Windows:

macOS/Linux:

Install dependencies

3. Configuration

4. Initialize System

Launch application

5. Access Application

First Run Checklist

🏗️ Architecture Overview

System Architecture

Core Components

📄 Document Processing Pipeline

4. Understanding Responses

Response Structure

Source References

5. Managing Conversations

Chat History

Reset Options

Best Practices

Document Upload Tips

Query Optimization

Common Use Cases

Academic Research

Business Analysis

Technical Documentation

🛠️ Developer Guide

Project Structure

Development Setup

1. Development Environment

Create development environment

Install dependencies with development tools

2. Code Quality Tools

Lint code

Type checking

3. Testing

Test coverage

Core Architecture

LangChain Integration

Document Processing Chain

Vector Store Creation

RAG Pipeline

Create retriever

Custom prompt template

Build QA chain

Custom Components

Intent Analysis Engine

Multi-Strategy Retrieval

Adding New Features

1. Custom Document Loaders

2. New Retrieval Strategies

3. Enhanced UI Components

Performance Optimization

Caching Strategies

Async Processing

Error Handling

Graceful Degradation

User-Friendly Error Messages

🚀 Deployment Guide

Local Deployment

Standard Setup

Configure production settings

Edit .env with production values

Run application

Docker Deployment

Dockerfile

Install system dependencies

Copy requirements and install Python dependencies

Copy application code

Create data directory

Expose port

Health check

Issue: `pip install` fails