๐ Enhanced RAG PDF Chat โ Documentation - SaktiSriraj/pdf-knowledge-extractor GitHub Wiki
- ๐ Getting Started
- ๐๏ธ Architecture Overview
- ๐ API Reference
- ๐ง Configuration Guide
- ๐ User Guide
- ๐ ๏ธ Developer Guide
- ๐ Deployment Guide
- โ Troubleshooting
- ๐ค Contributing
- ๐ Changelog
- Python 3.8+ (Recommended: Python 3.9 or 3.10)
- OpenRouter API Key (Free at openrouter.ai)
- 4GB+ RAM for optimal performance
- Internet connection for initial model downloads
git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor
# Create virtual environment
python -m venv venv
Activate environment
Windows:
venv\Scripts\activate
macOS/Linux:
source venv/bin/activate
Install dependencies
pip install -r requirements.txt
# Create environment file
echo "OPENROUTER_API_KEY=your_key_here" > .env
# Download and setup models (optional - auto-downloads on first use)
python install_enhanced.py
Launch application
streamlit run app.py
Open your browser and navigate to: http://localhost:8501
- Python 3.8+ installed
- Virtual environment activated
- Dependencies installed successfully
- API key configured in
.env
file - Application launches without errors
- Can access web interface
graph TB
A[๐ Document Input] --> B[๐ LangChain Processor]
B --> C[โ๏ธ Advanced Text Splitting]
C --> D[๐ง Sentence Transformers]
D --> E[๐ FAISS Vector Index]
F[๐ฌ User Query] --> G[๐ฏ Intent Analysis]
G --> H[๐ Multi-Strategy Retrieval]
H --> E
E --> I[๐ Context Assembly]
I --> J[๐ค DeepSeek R1 API]
J --> K[โจ Enhanced Response]
style A fill:#e1f5fe,color:#000000
style E fill:#f3e5f5,color:#000000
style J fill:#e8f5e8,color:#000000
style K fill:#fff3e0,color:#000000
Component | File | Purpose | Technology |
---|---|---|---|
Document Loader | preprocess.py | PDF/Text extraction | LangChain + PyMuPDF |
Text Splitter | preprocess.py | Intelligent chunking | RecursiveCharacterTextSplitter |
Embedding Engine | indexer.py | Vector generation | Sentence Transformers |
Vector Store | indexer.py | Similarity search | FAISS |
- Main Answer: Comprehensive response to your query
- Source Attribution: References to specific document sections
- Relevance Scores: Confidence indicators for retrieved content
Click "๐ Sources" to expand and see:
- Source document names
- Relevant text excerpts
- Confidence scores
- Page numbers (when available)
- Last 5 conversations displayed by default
- Scroll to see complete conversation history
- Context maintained across related questions
- ๐๏ธ Clear Conversation: Remove chat history only
- ๐ Reset All Data: Clear documents and conversations
- Optimal File Sizes: 1-50 MB per PDF for best performance
- Text-based PDFs: Ensure PDFs contain searchable text (not just images)
- Multiple Documents: Upload related documents together for cross-referencing
- File Naming: Use descriptive names for better source attribution
- Be Specific: "Analyze financial performance metrics" vs "Tell me about finances"
- Use Context: Reference previous questions for follow-up queries
- Specify Format: "List the top 5..." or "Provide a brief summary..."
- Multiple Aspects: "Compare X and Y in terms of cost, efficiency, and scalability"
- "Summarize the literature review section"
- "What methodologies were used in the studies?"
- "Compare the findings across different research papers"
- "What are the key financial metrics presented?"
- "Analyze the market trends discussed"
- "List the recommendations for improvement"
- "Explain the system architecture"
- "What are the installation requirements?"
- "List the troubleshooting steps for issue X"
pdf-knowledge-extractor/
โโโ app.py # Main Streamlit application
โโโ rag_pipeline.py # Core RAG implementation
โโโ indexer.py # Vector indexing and search
โโโ preprocess.py # Document processing
โโโ model_loader.py # LLM API management
โโโ upload_handler.py # File upload processing
โโโ scraper.py # URL PDF extraction
โโโ config.py # Configuration management
โโโ install_enhanced.py # Setup and model installation
โโโ requirements.txt # Python dependencies
โโโ .env # Environment variables
โโโ .gitignore # Git exclusions
โโโ README.md # Project documentation
# Clone repository
git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor
Create development environment
python -m venv dev-env
source dev-env/bin/activate # or dev-env\Scripts\activate on Windows
Install dependencies with development tools
pip install -r requirements.txt
pip install pytest black isort flake8 mypy
# Format code
black . --line-length 88
isort . --profile black
Lint code
flake8 . --max-line-length 88 --ignore E203,W503
Type checking
mypy . --ignore-missing-imports
# Run tests
python -m pytest tests/ -v
Test coverage
python -m pytest tests/ --cov=. --cov-report=html
# preprocess.py - Document loading and splitting
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = PyMuPDFLoader(file_path)
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
# indexer.py - FAISS vector store
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)
vector_store = FAISS.from_documents(documents, embeddings)
# rag_pipeline.py - Complete RAG implementation
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
Create retriever
retriever = vector_store.as_retriever(search_kwargs={"k": TOP_K})
Custom prompt template
prompt_template = PromptTemplate(
template="Context: {context}\nQuestion: {question}\nAnswer:",
input_variables=["context", "question"]
)
Build QA chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": prompt_template}
)
def analyze_query_intent(query: str) -> str:
"""Classify query type for optimized processing"""
intent_patterns = {
'summary': ['summarize', 'overview', 'main points'],
'analysis': ['analyze', 'evaluate', 'assessment'],
'comparison': ['compare', 'contrast', 'versus'],
'explanation': ['explain', 'how', 'why', 'what is']
}
query_lower = query.lower()
for intent, patterns in intent_patterns.items():
if any(pattern in query_lower for pattern in patterns):
return intent
return 'general'
def retrieve_chunks_advanced(query: str, index, strategy: str = "hybrid"):
"""Advanced retrieval with multiple strategies"""
if strategy == "similarity":
return similarity_search(query, index)
elif strategy == "mmr":
return mmr_search(query, index)
elif strategy == "hybrid":
# Combine multiple approaches
sim_results = similarity_search(query, index)
mmr_results = mmr_search(query, index)
return merge_and_rank(sim_results, mmr_results)
# Create new loader in preprocess.py
from langchain.document_loaders.base import BaseLoader
class CustomDocumentLoader(BaseLoader):
def init(self, file_path: str):
self.file_path = file_path
def load(self) -> List[Document]:
# Implement custom loading logic
pass
# Add to rag_pipeline.py
def create_custom_retriever(vector_store, strategy_config: Dict):
"""Create custom retrieval strategy"""
if strategy_config["type"] == "custom":
return CustomRetriever(
vector_store=vector_store,
**strategy_config["params"]
)
# Add to app.py
def display_custom_component():
"""Custom Streamlit component"""
with st.container():
col1, col2 = st.columns(2)
with col1:
# Custom visualization
pass
with col2:
# Interactive controls
pass
# Streamlit caching
@st.cache_data
def load_and_process_documents():
"""Cache document processing results"""
return process_documents()
@st.cache_resource
def initialize_models():
"""Cache model initialization"""
return load_embedding_model()
import asyncio
from concurrent.futures import ThreadPoolExecutor
async def process_documents_async(documents: List[str]):
"""Async document processing"""
loop = asyncio.get_event_loop()
with ThreadPoolExecutor() as executor:
tasks = [
loop.run_in_executor(executor, process_single_doc, doc)
for doc in documents
]
return await asyncio.gather(*tasks)
def robust_processing_pipeline(documents):
"""Processing with fallback mechanisms"""
try:
# Try advanced LangChain processing
return langchain_processing(documents)
except ImportError:
# Fallback to basic processing
return basic_processing(documents)
except Exception as e:
# Log error and use minimal processing
logger.error(f"Processing error: {e}")
return minimal_processing(documents)
def handle_api_errors(func):
"""Decorator for API error handling"""
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except ConnectionError:
return "๐ Connection error. Please check your internet connection."
except AuthenticationError:
return "๐ API key invalid. Please check your configuration."
except RateLimitError:
return "โฐ Rate limit exceeded. Please wait and try again."
return wrapper
# Production environment
python -m venv prod-env
source prod-env/bin/activate
pip install -r requirements.txt
Configure production settings
cp .env.example .env
Edit .env with production values
Run application
streamlit run app.py --server.port 8501 --server.address 0.0.0.0
FROM python:3.9-slim
WORKDIR /app
Install system dependencies
RUN apt-get update && apt-get install -y
build-essential
curl
&& rm -rf /var/lib/apt/lists/*
Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
Copy application code
COPY . .
Create data directory
RUN mkdir -p data/pdfs data/texts
Expose port
EXPOSE 8501
Health check
HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
Run application
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]
version: '3.8'
services:
rag-pdf-chat:
build: .
ports:
- "8501:8501"
environment:
- OPENROUTER_API_KEY=${OPENROUTER_API_KEY}
volumes:
- ./data:/app/data
restart: unless-stopped
# Build image
docker build -t pdf-knowledge-extractor .
Run container
docker run -p 8501:8501 -e OPENROUTER_API_KEY=your_key pdf-knowledge-extractor
Using docker-compose
docker-compose up -d
-
Connect Repository
- Link GitHub repository to Streamlit Cloud
- Configure automatic deployments
-
Environment Variables
# Add in Streamlit Cloud dashboard OPENROUTER_API_KEY = "your_key_here" STREAMLIT_SERVER_MAX_UPLOAD_SIZE = 200
-
Requirements
# Ensure requirements.txt includes all dependencies streamlit>=1.28.0 langchain>=0.1.0 # ... other dependencies
# Launch EC2 instance (Ubuntu 20.04 LTS)
# Connect via SSH
Update system
sudo apt update && sudo apt upgrade -y
Install Python and dependencies
sudo apt install python3-pip python3-venv git -y
Clone repository
git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor
Setup application
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Configure environment
cp .env.example .env
nano .env # Add your API key
Install and configure nginx
sudo apt install nginx -y
# /etc/nginx/sites-available/rag-pdf-chat
server {
listen 80;
server_name your-domain.com;
location / {
proxy_pass http://127.0.0.1:8501;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
# /etc/systemd/system/rag-pdf-chat.service
[Unit]
Description=Enhanced RAG PDF Chat
After=network.target
[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/pdf-knowledge-extractor
Environment=PATH=/home/ubuntu/pdf-knowledge-extractor/venv/bin
ExecStart=/home/ubuntu/pdf-knowledge-extractor/venv/bin/streamlit run app.py --server.port 8501
Restart=always
[Install]
WantedBy=multi-user.target
# Enable and start service
sudo systemctl enable rag-pdf-chat
sudo systemctl start rag-pdf-chat
Check status
sudo systemctl status rag-pdf-chat
View logs
sudo journalctl -u rag-pdf-chat -f
web: streamlit run app.py --server.port=$PORT --server.address=0.0.0.0
# Install Heroku CLI and login
heroku login
Create app
heroku create your-app-name
Set environment variables
heroku config:set OPENROUTER_API_KEY=your_key
Deploy
git push heroku main
-
Environment Variables
# Never commit .env files # Use secure environment variable management export OPENROUTER_API_KEY="secure_key_here"
-
HTTPS Configuration
# Use SSL certificates sudo certbot --nginx -d your-domain.com
-
Access Control
# Add authentication if needed import streamlit_authenticator as stauth
-
Resource Limits
# config.py - Production settings MAX_UPLOAD_SIZE = 50 * 1024 * 1024 # 50MB MAX_CONCURRENT_USERS = 10 CACHE_TTL = 3600 # 1 hour
-
Monitoring
# Add monitoring and logging import logging
logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' )
-
Horizontal Scaling
- Use load balancers for multiple instances
- Implement session state management
- Consider Redis for shared caching
-
Vertical Scaling
- Optimize memory usage
- Use GPU instances for large models
- Implement efficient caching strategies
# Solution 1: Upgrade pip
python -m pip install --upgrade pip
Solution 2: Use specific index
pip install -r requirements.txt -i https://pypi.org/simple/
Solution 3: Install specific problematic packages
pip install torch --index-url https://download.pytorch.org/whl/cpu
# Check LangChain version
pip show langchain
Reinstall with specific version
pip uninstall langchain langchain-community
pip install langchain==0.1.0 langchain-community==0.0.10
# Windows specific
pip install faiss-cpu --no-cache-dir
macOS specific
conda install -c conda-forge faiss-cpu
Linux specific
pip install faiss-cpu==1.7.4
# Check environment loading
from dotenv import load_dotenv
load_dotenv()
print(os.getenv("OPENROUTER_API_KEY"))
Verify .env file format
OPENROUTER_API_KEY=your_key_here # No quotes, no spaces
# Install specific version
pip install sentence-transformers==2.2.2
Clear cache and reinstall
pip cache purge
pip install sentence-transformers --no-cache-dir
# Check data directory structure
import os
print(os.listdir("data/"))
Reset and rebuild index
python -c "
from indexer import get_index_info
print(get_index_info())
"
# Reduce chunk size for faster processing
# In
- [๐ Getting Started](#-getting-started)
- [๐๏ธ Architecture Overview](#๏ธ-architecture-overview)
- [๐ API Reference](#-api-reference)
- [๐ง Configuration Guide](#-configuration-guide)
- [๐ User Guide](#-user-guide)
- [๐ ๏ธ Developer Guide](#๏ธ-developer-guide)
- [๐ Deployment Guide](#-deployment-guide)
- [โ Troubleshooting](#-troubleshooting)
- [๐ค Contributing](#-contributing)
- [๐ Changelog](#-changelog)
- Python 3.8+ (Recommended: Python 3.9 or 3.10)
- OpenRouter API Key (Free at [openrouter.ai](https://openrouter.ai))
- 4GB+ RAM for optimal performance
- Internet connection for initial model downloads
git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor
# Create virtual environment
python -m venv venv
# Activate environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Create environment file
echo "OPENROUTER_API_KEY=your_key_here" > .env
# Download and setup models (optional - auto-downloads on first use)
python install_enhanced.py
# Launch application
streamlit run app.py
Open your browser and navigate to: http://localhost:8501
- Python 3.8+ installed
- Virtual environment activated
- Dependencies installed successfully
- API key configured in
.env
file - Application launches without errors
- Can access web interface
graph TB
A[๐ Document Input] --> B[๐ LangChain Processor]
B --> C[โ๏ธ Advanced Text Splitting]
C --> D[๐ง Sentence Transformers]
D --> E[๐ FAISS Vector Index]
F[๐ฌ User Query] --> G[๐ฏ Intent Analysis]
G --> H[๐ Multi-Strategy Retrieval]
H --> E
E --> I[๐ Context Assembly]
I --> J[๐ค DeepSeek R1 API]
J --> K[โจ Enhanced Response]
style A fill:#e1f5fe,color:#000000
style E fill:#f3e5f5,color:#000000
style J fill:#e8f5e8,color:#000000
style K fill:#fff3e0,color:#000000
Component | File | Purpose | Technology |
---|---|---|---|
Document Loader | preprocess.py |
PDF/Text extraction | LangChain + PyMuPDF |
Text Splitter | preprocess.py |
Intelligent chunking | RecursiveCharacterTextSplitter |
Embedding Engine | indexer.py |
Vector generation | Sentence Transformers |
Vector Store | indexer.py |
Similarity search | FAISS |
Component | File | Purpose | Technology |
---|---|---|---|
Intent Analyzer | rag_pipeline.py |
Query classification | Custom NLP |
Retriever | rag_pipeline.py |
Context extraction | Multi-strategy search |
Prompt Engine | rag_pipeline.py |
Dynamic prompting | LangChain Templates |
LLM Interface | model_loader.py |
Response generation | OpenRouter API |
Component | File | Purpose | Technology |
---|---|---|---|
Main App | app.py |
Web interface | Streamlit |
Upload Handler | upload_handler.py |
File processing | PyMuPDF |
URL Scraper | scraper.py |
Web PDF extraction | Requests + PyMuPDF |
Configuration | config.py |
Settings management | Environment Variables |
PDF/URL โ Text Extraction โ Chunking โ Embedding โ Vector Storage
User Query โ Intent Analysis โ Retrieval โ Context Assembly โ LLM โ Response
Raw Response โ Source Attribution โ Formatting โ UI Display
Main RAG pipeline orchestrator.
from rag_pipeline import EnhancedLangChainRAG
rag = EnhancedLangChainRAG()
Initialize sentence transformer model for embeddings.
Returns: SentenceTransformer
instance
Example:
model = rag.initialize_embedding_model()
Analyze user query to determine optimal response strategy.
Parameters:
-
query
(str): User input query
Returns: Intent type ('summary', 'analysis', 'comparison', 'explanation', 'general')
Example:
intent = rag.analyze_query_intent("Summarize the main points")
# Returns: 'summary'
Advanced document indexing with FAISS integration.
from indexer import OptimizedLangChainIndexer
indexer = OptimizedLangChainIndexer()
Create optimized FAISS vector store.
Parameters:
-
chunks
(List[str]): Text chunks to index -
metadata
(List[Dict]): Chunk metadata
Returns: (index, chunks)
tuple
Example:
index, processed_chunks = indexer.create_vectorstore_fast(
chunks=["chunk1", "chunk2"],
metadata=[{"source": "doc1"}, {"source": "doc2"}]
)
OpenRouter API management and response generation.
from model_loader import LLMManager
llm = LLMManager()
Generate AI response using configured model.
Parameters:
-
prompt
(str): Input prompt -
max_tokens
(int): Maximum response length -
temperature
(float): Response creativity (0.0-1.0)
Returns: Generated response text
Example:
response = llm.generate_response(
prompt="Explain quantum computing",
max_tokens=500,
temperature=0.7
)
Load and process documents with advanced splitting.
Parameters:
-
strategy
(str): Splitting strategy ('recursive', 'token_based', 'semantic', 'hybrid')
Returns: (chunks, metadata)
tuple
Example:
from preprocess import load_and_split_texts
chunks, metadata = load_and_split_texts(strategy='hybrid')
Build optimized FAISS search index.
Example:
from indexer import build_faiss_index
index, processed_chunks = build_faiss_index(chunks, metadata)
run_rag_pipeline(query: str, index, chunks: List[str], metadata: List[Dict], max_tokens: int = 2000) -> tuple
Execute complete RAG pipeline for query processing.
Parameters:
-
query
(str): User question -
index
: FAISS index -
chunks
(List[str]): Document chunks -
metadata
(List[Dict]): Chunk metadata -
max_tokens
(int): Response length limit
Returns: (response, chunk_info)
tuple
Example:
from rag_pipeline import run_rag_pipeline
response, sources = run_rag_pipeline(
query="What are the main findings?",
index=faiss_index,
chunks=document_chunks,
metadata=chunk_metadata
)
Get comprehensive document processing statistics.
Example:
from preprocess import get_processing_stats
stats = get_processing_stats()
print(f"Processed {stats['files']} files with {stats['total_words']} words")
Get FAISS index information and status.
Example:
from indexer import get_index_info
info = get_index_info()
print(f"Index contains {info['chunk_count']} chunks")
Create a .env
file in your project root:
# OpenRouter API Configuration (Required)
OPENROUTER_API_KEY=your_api_key_here
OPENROUTER_MODEL=deepseek/deepseek-r1-0528:free
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
# Application Settings
SITE_URL=http://localhost:8501
SITE_NAME=Enhanced RAG PDF Chat
# LangChain Text Processing
LANGCHAIN_CHUNK_SIZE=600
LANGCHAIN_CHUNK_OVERLAP=100
LANGCHAIN_MODEL_TEMPERATURE=0.7
LANGCHAIN_MAX_TOKENS=2000
LANGCHAIN_ENABLE_CACHING=true
LANGCHAIN_VERBOSE=false
# Document Processing
CHUNK_SIZE=500
CHUNK_OVERLAP=50
TOP_K=5
# Performance Settings
DEVICE=cpu
# Base directories
BASE_DIR = Path(__file__).parent
DATA_DIR = BASE_DIR / "data"
# Data paths
PDF_DIR = str(DATA_DIR / "pdfs")
TEXT_DIR = str(DATA_DIR / "texts")
FAISS_INDEX_PATH = str(DATA_DIR / "faiss_index")
METADATA_PATH = str(DATA_DIR / "metadata.json")
# Model settings
EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
pdf-knowledge-extractor/
โโโ app.py # Main application
โโโ config.py # Configuration management
โโโ data/ # Data storage
โ โโโ pdfs/ # Original PDF files
โ โโโ texts/ # Extracted text files
โ โโโ faiss_index* # Vector index files
โ โโโ metadata.json # Processing metadata
โโโ .env # Environment variables
โโโ requirements.txt # Dependencies
# For systems with limited RAM
LANGCHAIN_CHUNK_SIZE=400
CHUNK_SIZE=300
TOP_K=3
# For faster processing
LANGCHAIN_ENABLE_CACHING=true
DEVICE=cpu # Use 'cuda' if GPU available
# For higher quality responses
LANGCHAIN_MAX_TOKENS=3000
LANGCHAIN_MODEL_TEMPERATURE=0.5
TOP_K=8
- Navigate to the "๐ Add Your Documents" section
- Click the "๐ Upload Files" tab
- Drag and drop PDF files or click "Browse files"
- Click "๐ Process Documents" button
- Watch real-time processing progress
- Click the "๐ From URL" tab
- Enter a direct PDF URL (e.g.,
https://example.com/document.pdf
) - Click "๐ Process Documents"
- System downloads and processes automatically
The system performs these steps automatically:
- ๐ฅ Processing Documents - Upload/download validation
- ๐ Extracting Text Content - PDF text extraction
- โ๏ธ Splitting into Sections - Intelligent chunking
- ๐ง Generating Embeddings - Vector creation
- ๐ Building Search Index - FAISS index construction
- โ Finalizing Setup - Optimization and caching
- "What are the main topics discussed?"
- "Summarize the key findings"
- "List the important conclusions"
- "Compare the methodologies used in different sections"
- "Analyze the strengths and weaknesses presented"
- "Explain the relationship between X and Y concepts"
Type | Example | Optimization |
---|---|---|
Summary | "Summarize the main points" | Hierarchical organization |
Analysis | "Analyze the data trends" | Critical evaluation focus |
Comparison | "Compare approach A vs B" | Systematic contrasting |
Explanation | "Explain how this works" | Step-by-step breakdown |
Specific | "What is the definition of X?" | Precise information extraction |
- Main Answer: Comprehensive response to your query
- Source Attribution: References to specific document sections
- Relevance Scores: Confidence indicators for retrieved content
Click "๐ Sources" to expand and see:
- Source document names
- Relevant text excerpts
- Confidence scores
- Page numbers (when available)
- Last 5 conversations displayed by default
- Scroll to see complete conversation history
- Context maintained across related questions
- ๐๏ธ Clear Conversation: Remove chat history only
- ๐ Reset All Data: Clear documents and conversations
- Optimal File Sizes: 1-50 MB per PDF for best performance
- Text-based PDFs: Ensure PDFs contain searchable text (not just images)
- Multiple Documents: Upload related documents together for cross-referencing
- File Naming: Use descriptive names for better source attribution
- Be Specific: "Analyze financial performance metrics" vs "Tell me about finances"
- Use Context: Reference previous questions for follow-up queries
- Specify Format: "List the top 5..." or "Provide a brief summary..."
- Multiple Aspects: "Compare X and Y in terms of cost, efficiency, and scalability"
- "Summarize the literature review section"
- "What methodologies were used in the studies?"
- "Compare the findings across different research papers"
- "What are the key financial metrics presented?"
- "Analyze the market trends discussed"
- "List the recommendations for improvement"
- "Explain the system architecture"
- "What are the installation requirements?"
- "List the troubleshooting steps for issue X"
pdf-knowledge-extractor/
โโโ app.py # Main Streamlit application
โโโ rag_pipeline.py # Core RAG implementation
โโโ indexer.py # Vector indexing and search
โโโ preprocess.py # Document processing
โโโ model_loader.py # LLM API management
โโโ upload_handler.py # File upload processing
โโโ scraper.py # URL PDF extraction
โโโ config.py # Configuration management
โโโ install_enhanced.py # Setup and model installation
โโโ requirements.txt # Python dependencies
โโโ .env # Environment variables
โโโ .gitignore # Git exclusions
โโโ README.md # Project documentation
# Clone repository
git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor
# Create development environment
python -m venv dev-env
source dev-env/bin/activate # or dev-env\Scripts\activate on Windows
# Install dependencies with development tools
pip install -r requirements.txt
pip install pytest black isort flake8 mypy
# Format code
black . --line-length 88
isort . --profile black
# Lint code
flake8 . --max-line-length 88 --ignore E203,W503
# Type checking
mypy . --ignore-missing-imports
# Run tests
python -m pytest tests/ -v
# Test coverage
python -m pytest tests/ --cov=. --cov-report=html
# preprocess.py - Document loading and splitting
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = PyMuPDFLoader(file_path)
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
# indexer.py - FAISS vector store
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)
vector_store = FAISS.from_documents(documents, embeddings)
# rag_pipeline.py - Complete RAG implementation
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# Create retriever
retriever = vector_store.as_retriever(search_kwargs={"k": TOP_K})
# Custom prompt template
prompt_template = PromptTemplate(
template="Context: {context}\nQuestion: {question}\nAnswer:",
input_variables=["context", "question"]
)
# Build QA chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": prompt_template}
)
def analyze_query_intent(query: str) -> str:
"""Classify query type for optimized processing"""
intent_patterns = {
'summary': ['summarize', 'overview', 'main points'],
'analysis': ['analyze', 'evaluate', 'assessment'],
'comparison': ['compare', 'contrast', 'versus'],
'explanation': ['explain', 'how', 'why', 'what is']
}
query_lower = query.lower()
for intent, patterns in intent_patterns.items():
if any(pattern in query_lower for pattern in patterns):
return intent
return 'general'
def retrieve_chunks_advanced(query: str, index, strategy: str = "hybrid"):
"""Advanced retrieval with multiple strategies"""
if strategy == "similarity":
return similarity_search(query, index)
elif strategy == "mmr":
return mmr_search(query, index)
elif strategy == "hybrid":
# Combine multiple approaches
sim_results = similarity_search(query, index)
mmr_results = mmr_search(query, index)
return merge_and_rank(sim_results, mmr_results)
# Create new loader in preprocess.py
from langchain.document_loaders.base import BaseLoader
class CustomDocumentLoader(BaseLoader):
def __init__(self, file_path: str):
self.file_path = file_path
def load(self) -> List[Document]:
# Implement custom loading logic
pass
# Add to rag_pipeline.py
def create_custom_retriever(vector_store, strategy_config: Dict):
"""Create custom retrieval strategy"""
if strategy_config["type"] == "custom":
return CustomRetriever(
vector_store=vector_store,
**strategy_config["params"]
)
# Add to app.py
def display_custom_component():
"""Custom Streamlit component"""
with st.container():
col1, col2 = st.columns(2)
with col1:
# Custom visualization
pass
with col2:
# Interactive controls
pass
# Streamlit caching
@st.cache_data
def load_and_process_documents():
"""Cache document processing results"""
return process_documents()
@st.cache_resource
def initialize_models():
"""Cache model initialization"""
return load_embedding_model()
import asyncio
from concurrent.futures import ThreadPoolExecutor
async def process_documents_async(documents: List[str]):
"""Async document processing"""
loop = asyncio.get_event_loop()
with ThreadPoolExecutor() as executor:
tasks = [
loop.run_in_executor(executor, process_single_doc, doc)
for doc in documents
]
return await asyncio.gather(*tasks)
def robust_processing_pipeline(documents):
"""Processing with fallback mechanisms"""
try:
# Try advanced LangChain processing
return langchain_processing(documents)
except ImportError:
# Fallback to basic processing
return basic_processing(documents)
except Exception as e:
# Log error and use minimal processing
logger.error(f"Processing error: {e}")
return minimal_processing(documents)
def handle_api_errors(func):
"""Decorator for API error handling"""
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except ConnectionError:
return "๐ Connection error. Please check your internet connection."
except AuthenticationError:
return "๐ API key invalid. Please check your configuration."
except RateLimitError:
return "โฐ Rate limit exceeded. Please wait and try again."
return wrapper
# Production environment
python -m venv prod-env
source prod-env/bin/activate
pip install -r requirements.txt
# Configure production settings
cp .env.example .env
# Edit .env with production values
# Run application
streamlit run app.py --server.port 8501 --server.address 0.0.0.0
FROM python:3.9-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
curl \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Create data directory
RUN mkdir -p data/pdfs data/texts
# Expose port
EXPOSE 8501
# Health check
HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
# Run application
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]
version: '3.8'
services:
rag-pdf-chat:
build: .
ports:
- "8501:8501"
environment:
- OPENROUTER_API_KEY=${OPENROUTER_API_KEY}
volumes:
- ./data:/app/data
restart: unless-stopped
# Build image
docker build -t pdf-knowledge-extractor .
# Run container
docker run -p 8501:8501 -e OPENROUTER_API_KEY=your_key pdf-knowledge-extractor
# Using docker-compose
docker-compose up -d
-
Connect Repository
- Link GitHub repository to Streamlit Cloud
- Configure automatic deployments
-
Environment Variables
# Add in Streamlit Cloud dashboard OPENROUTER_API_KEY = "your_key_here" STREAMLIT_SERVER_MAX_UPLOAD_SIZE = 200
-
Requirements
# Ensure requirements.txt includes all dependencies streamlit>=1.28.0 langchain>=0.1.0 # ... other dependencies
# Launch EC2 instance (Ubuntu 20.04 LTS)
# Connect via SSH
# Update system
sudo apt update && sudo apt upgrade -y
# Install Python and dependencies
sudo apt install python3-pip python3-venv git -y
# Clone repository
git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor
# Setup application
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Configure environment
cp .env.example .env
nano .env # Add your API key
# Install and configure nginx
sudo apt install nginx -y
# /etc/nginx/sites-available/rag-pdf-chat
server {
listen 80;
server_name your-domain.com;
location / {
proxy_pass http://127.0.0.1:8501;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
# /etc/systemd/system/rag-pdf-chat.service
[Unit]
Description=Enhanced RAG PDF Chat
After=network.target
[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/pdf-knowledge-extractor
Environment=PATH=/home/ubuntu/pdf-knowledge-extractor/venv/bin
ExecStart=/home/ubuntu/pdf-knowledge-extractor/venv/bin/streamlit run app.py --server.port 8501
Restart=always
[Install]
WantedBy=multi-user.target
# Enable and start service
sudo systemctl enable rag-pdf-chat
sudo systemctl start rag-pdf-chat
# Check status
sudo systemctl status rag-pdf-chat
# View logs
sudo journalctl -u rag-pdf-chat -f
web: streamlit run app.py --server.port=$PORT --server.address=0.0.0.0
# Install Heroku CLI and login
heroku login
# Create app
heroku create your-app-name
# Set environment variables
heroku config:set OPENROUTER_API_KEY=your_key
# Deploy
git push heroku main
-
Environment Variables
# Never commit .env files # Use secure environment variable management export OPENROUTER_API_KEY="secure_key_here"
-
HTTPS Configuration
# Use SSL certificates sudo certbot --nginx -d your-domain.com
-
Access Control
# Add authentication if needed import streamlit_authenticator as stauth
-
Resource Limits
# config.py - Production settings MAX_UPLOAD_SIZE = 50 * 1024 * 1024 # 50MB MAX_CONCURRENT_USERS = 10 CACHE_TTL = 3600 # 1 hour
-
Monitoring
# Add monitoring and logging import logging logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' )
-
Horizontal Scaling
- Use load balancers for multiple instances
- Implement session state management
- Consider Redis for shared caching
-
Vertical Scaling
- Optimize memory usage
- Use GPU instances for large models
- Implement efficient caching strategies
# Solution 1: Upgrade pip
python -m pip install --upgrade pip
# Solution 2: Use specific index
pip install -r requirements.txt -i https://pypi.org/simple/
# Solution 3: Install specific problematic packages
pip install torch --index-url https://download.pytorch.org/whl/cpu
# Check LangChain version
pip show langchain
# Reinstall with specific version
pip uninstall langchain langchain-community
pip install langchain==0.1.0 langchain-community==0.0.10
# Windows specific
pip install faiss-cpu --no-cache-dir
# macOS specific
conda install -c conda-forge faiss-cpu
# Linux specific
pip install faiss-cpu==1.7.4
# Check environment loading
from dotenv import load_dotenv
load_dotenv()
print(os.getenv("OPENROUTER_API_KEY"))
# Verify .env file format
OPENROUTER_API_KEY=your_key_here # No quotes, no spaces
# Install specific version
pip install sentence-transformers==2.2.2
# Clear cache and reinstall
pip cache purge
pip install sentence-transformers --no-cache-dir
# Check data directory structure
import os
print(os.listdir("data/"))
# Reset and rebuild index
python -c "
from indexer import get_index_info
print(get_index_info())
"
# Reduce chunk size for faster processing
# In .env file:
LANGCHAIN_CHUNK_SIZE=400
CHUNK_SIZE=300
TOP_K=3
# Or modify config.py temporarily:
CHUNK_SIZE = 300
CHUNK_OVERLAP = 30
# Monitor memory usage
import psutil
print(f"Memory usage: {psutil.virtual_memory().percent}%")
# Optimize settings in config.py:
DEVICE = "cpu" # Ensure CPU-only processing
LANGCHAIN_ENABLE_CACHING = False # Disable caching if needed
# Check API latency
curl -w "@curl-format.txt" -o /dev/null -s "https://openrouter.ai/api/v1/models"
# Reduce context size
TOP_K=3 # Fewer retrieved chunks
LANGCHAIN_MAX_TOKENS=1000 # Shorter responses
# Check port availability
netstat -an | grep 8501
# Try different port
streamlit run app.py --server.port 8502
# Clear Streamlit cache
streamlit cache clear
# Check file size limits
# In .streamlit/config.toml:
[server]
maxUploadSize = 200
# Verify file permissions
ls -la data/pdfs/
chmod 755 data/
# Check session state
import streamlit as st
print(st.session_state)
# Clear and reinitialize
if st.button("Reset Session"):
for key in st.session_state.keys():
del st.session_state[key]
st.rerun()
# Test API connection
from model_loader import test_openrouter_connection
success, message = test_openrouter_connection()
print(f"API Status: {success}, Message: {message}")
# Check API key validity
import requests
headers = {"Authorization": f"Bearer {OPENROUTER_API_KEY}"}
response = requests.get("https://openrouter.ai/api/v1/models", headers=headers)
print(f"Status: {response.status_code}")
# Implement retry logic
import time
from functools import wraps
def retry_api_call(max_retries=3, delay=1):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except Exception as e:
if "rate limit" in str(e).lower() and attempt < max_retries - 1:
time.sleep(delay * (2 ** attempt)) # Exponential backoff
continue
raise e
return None
return wrapper
return decorator
# Add to config.py
DEBUG = True
LANGCHAIN_VERBOSE = True
LANGCHAIN_DEBUG = True
# Enable detailed logging
import logging
logging.basicConfig(level=logging.DEBUG)
# Install memory profiler
pip install memory-profiler
# Profile specific functions
from memory_profiler import profile
@profile
def process_large_document():
# Your processing code
pass
# Add timing decorators
import time
from functools import wraps
def timer(func):
@wraps(func)
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
end = time.time()
print(f"{func.__name__} took {end - start:.2f} seconds")
return result
return wrapper
@timer
def slow_function():
# Function to profile
pass
# Test connectivity
ping openrouter.ai
nslookup openrouter.ai
# Check DNS resolution
dig openrouter.ai
# Test HTTPS connectivity
curl -I https://openrouter.ai/api/v1/models
# Implement graceful fallbacks
def robust_rag_pipeline(query, fallback_enabled=True):
try:
return advanced_rag_pipeline(query)
except Exception as e:
if fallback_enabled:
print(f"Advanced pipeline failed: {e}")
return simple_rag_pipeline(query)
raise e
def simple_rag_pipeline(query):
"""Simplified RAG without advanced features"""
# Basic similarity search without LangChain
pass
# Backup and restore functions
def backup_index():
"""Backup FAISS index and metadata"""
import shutil
from datetime import datetime
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
backup_dir = f"backups/backup_{timestamp}"
os.makedirs(backup_dir, exist_ok=True)
shutil.copytree("data/", f"{backup_dir}/data/")
print(f"Backup created: {backup_dir}")
def restore_index(backup_path):
"""Restore from backup"""
import shutil
if os.path.exists(backup_path):
shutil.copytree(f"{backup_path}/data/", "data/", dirs_exist_ok=True)
print("Restore completed")
# Fork the repository on GitHub
# Clone your fork
git clone https://github.com/saktisriraj/pdf-knowledge-extractor.git
cd pdf-knowledge-extractor
# Add upstream remote
git remote add upstream https://github.com/originalowner/pdf-knowledge-extractor.git
# Create development environment
python -m venv dev-env
source dev-env/bin/activate # Windows: dev-env\Scripts\activate
# Install development dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt # Development tools
Create requirements-dev.txt
:
# Testing
pytest>=7.0.0
pytest-cov>=4.0.0
pytest-mock>=3.10.0
# Code Quality
black>=23.0.0
isort>=5.12.0
flake8>=6.0.0
mypy>=1.0.0
# Documentation
sphinx>=6.0.0
sphinx-rtd-theme>=1.2.0
# Development Tools
pre-commit>=3.0.0
jupyter>=1.0.0
ipython>=8.0.0
# Use Black formatter with 88 character line length
black . --line-length 88
# Import sorting with isort
isort . --profile black
# Type hints for all functions
def process_document(file_path: str, chunk_size: int = 500) -> List[str]:
"""Process document and return chunks.
Args:
file_path: Path to the document file
chunk_size: Size of text chunks
Returns:
List of text chunks
Raises:
FileNotFoundError: If file doesn't exist
ValueError: If chunk_size is invalid
"""
pass
class DocumentProcessor:
"""Advanced document processing with LangChain integration.
This class handles document loading, text extraction, and chunking
using various LangChain strategies for optimal performance.
Attributes:
strategy: Text splitting strategy ('recursive', 'semantic', etc.)
chunk_size: Maximum size of text chunks
Example:
>>> processor = DocumentProcessor(strategy='hybrid')
>>> chunks = processor.process_pdf('document.pdf')
>>> print(f"Created {len(chunks)} chunks")
"""
def __init__(self, strategy: str = 'hybrid', chunk_size: int = 500):
"""Initialize processor with specified strategy."""
pass
# Format: <type>(<scope>): <description>
# Types: feat, fix, docs, style, refactor, test, chore
# Examples:
git commit -m "feat(rag): add multi-query retrieval strategy"
git commit -m "fix(ui): resolve file upload progress bar issue"
git commit -m "docs(api): update RAG pipeline documentation"
git commit -m "refactor(indexer): optimize FAISS index creation"
Use this template for bug reports:
## Bug Report
**Describe the bug**
A clear description of what the bug is.
**To Reproduce**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error
**Expected behavior**
What you expected to happen.
**Screenshots**
If applicable, add screenshots.
**Environment:**
- OS: [e.g. Windows 10, macOS 12.0, Ubuntu 20.04]
- Python Version: [e.g. 3.9.7]
- Dependencies: [paste requirements.txt versions]
**Additional context**
Any other context about the problem.
## Feature Request
**Is your feature request related to a problem?**
A clear description of what the problem is.
**Describe the solution you'd like**
A clear description of what you want to happen.
**Describe alternatives you've considered**
Alternative solutions or features you've considered.
**Additional context**
Add any other context, mockups, or examples.
**Implementation ideas**
If you have ideas about how to implement this feature.
## Pull Request
**Description**
Brief description of changes made.
**Type of change**
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Documentation update
**Testing**
- [ ] Tests pass locally
- [ ] Added tests for new functionality
- [ ] Manual testing completed
**Checklist**
- [ ] Code follows project style guidelines
- [ ] Self-review completed
- [ ] Code is commented appropriately
- [ ] Documentation updated
- [ ] No new warnings introduced
-
Branch Naming
# Feature branches git checkout -b feature/add-multi-modal-support # Bug fix branches git checkout -b fix/resolve-memory-leak # Documentation branches git checkout -b docs/update-api-reference
-
Development Workflow
# Update main branch git checkout main git pull upstream main # Create feature branch git checkout -b feature/your-feature-name # Make changes and commit git add . git commit -m "feat(component): add new functionality" # Push to your fork git push origin feature/your-feature-name # Create pull request on GitHub
-
Code Review Process
- All PRs require at least one review
- Address reviewer comments promptly
- Update documentation for new features
- Ensure tests pass before requesting review
# tests/test_rag_pipeline.py
import pytest
from unittest.mock import Mock, patch
from rag_pipeline import EnhancedLangChainRAG
class TestEnhancedLangChainRAG:
def setup_method(self):
"""Setup test fixtures."""
self.rag = EnhancedLangChainRAG()
def test_analyze_query_intent_summary(self):
"""Test query intent analysis for summary queries."""
query = "Summarize the main points of this document"
intent = self.rag.analyze_query_intent(query)
assert intent == "summary"
@patch('rag_pipeline.generate_response')
def test_run_rag_pipeline_success(self, mock_generate):
"""Test successful RAG pipeline execution."""
mock_generate.return_value = "Test response"
# Mock dependencies
mock_index = Mock()
chunks = ["chunk1", "chunk2"]
metadata = [{"source": "doc1"}, {"source": "doc2"}]
response, chunk_info = self.rag.run_rag_pipeline(
"test query", mock_index, chunks, metadata
)
assert response == "Test response"
assert len(chunk_info) > 0
# tests/test_integration.py
import tempfile
import os
from app import main
from indexer import build_faiss_index
from preprocess import load_and_split_texts
class TestIntegration:
def test_end_to_end_pipeline(self):
"""Test complete document processing pipeline."""
with tempfile.TemporaryDirectory() as temp_dir:
# Create test PDF
test_pdf_path = os.path.join(temp_dir, "test.pdf")
self.create_test_pdf(test_pdf_path)
# Process document
chunks, metadata = load_and_split_texts()
assert len(chunks) > 0
# Build index
index, processed_chunks = build_faiss_index(chunks, metadata)
assert index is not None
# Test query
from rag_pipeline import run_rag_pipeline
response, sources = run_rag_pipeline(
"What is this document about?",
index, chunks, metadata
)
assert len(response) > 0
# Run all tests
python -m pytest tests/ -v
# Run specific test file
python -m pytest tests/test_rag_pipeline.py -v
# Run with coverage
python -m pytest tests/ --cov=. --cov-report=html
# Run integration tests only
python -m pytest tests/test_integration.py -v
# Use Google-style docstrings
def retrieve_chunks_advanced(
query: str,
index: Any,
chunks: List[str],
metadata: Optional[List[Dict]] = None,
top_k: int = 5
) -> Tuple[List[str], List[Dict]]:
"""Retrieve document chunks using advanced strategies.
Performs similarity search with multiple retrieval strategies
including semantic search, MMR, and hybrid approaches.
Args:
query: User query string for semantic search
index: FAISS index for similarity computation
chunks: List of document text chunks
metadata: Optional metadata for each chunk
top_k: Number of top chunks to retrieve
Returns:
Tuple containing:
- List of retrieved text chunks
- List of chunk metadata with relevance scores
Raises:
ValueError: If query is empty or index is invalid
IndexError: If top_k exceeds available chunks
Example:
>>> chunks, info = retrieve_chunks_advanced(
... "What is machine learning?",
... faiss_index,
... document_chunks,
... chunk_metadata,
... top_k=5
... )
>>> print(f"Retrieved {len(chunks)} relevant chunks")
"""
When adding features, update the README:
- Feature list - Add new capabilities
- Installation - Update if new dependencies
- Usage examples - Show new functionality
- Configuration - Document new settings
Maintain these wiki pages:
- API Reference - Complete function documentation
- Architecture Guide - System design and components
- Deployment Guide - Production deployment instructions
- Troubleshooting - Common issues and solutions
- Contributing Guide - This document
- Changelog - Version history and breaking changes
- Enhanced RAG Pipeline: Complete LangChain integration with advanced retrieval strategies
- Multi-Strategy Text Processing: Recursive, token-based, semantic, and hybrid chunking
- FAISS Vector Search: Optimized similarity search with contextual compression
- Intent-Based Query Analysis: Automatic query classification for optimized responses
- Real-Time UI: Animated Streamlit interface with progress tracking
- Document Processing: Advanced PDF loading with PyMuPDF and LangChain loaders
- Vector Indexing: High-performance FAISS indexing with sentence-transformers
- API Integration: OpenRouter API management with DeepSeek R1 model
- Configuration Management: Comprehensive environment variable system
- Error Handling: Graceful fallbacks and comprehensive error recovery
- Gradient Styling: Modern CSS with animated elements
- Progress Tracking: Real-time processing feedback with step indicators
- Source Attribution: Expandable source references with relevance scoring
- Conversation Memory: Context-aware chat with history management
- Performance Optimization: Parallel processing and intelligent caching
- Cross-Platform: Windows, macOS, and Linux compatibility
- Modular Design: Clean separation of concerns for easy extension
- Production Ready: Comprehensive error handling and monitoring
- Environment variable management with
.env
support - Configurable chunk sizes and overlap parameters
- Adjustable retrieval strategies and model parameters
- Customizable UI themes and animations
- Complete API reference with examples
- Architecture diagrams and component documentation
- Deployment guides for local and cloud environments
- Comprehensive troubleshooting and FAQ sections
- Unit tests for core components
- Integration tests for end-to-end workflows
- Performance benchmarks and optimization guidelines
- Error handling validation
- Feature description with technical details
- User-facing improvements and capabilities
- Issue resolution with root cause analysis
- Performance improvements and optimizations
- Architecture updates and refactoring
- Dependency updates and compatibility improvements
- New documentation sections
- Updated examples and tutorials
- API changes that require user action
- Configuration changes and migration guides
- Features marked for removal in future versions
- Migration paths for deprecated functionality
- Multi-Modal Support: Image and table extraction from PDFs
- Advanced Analytics: Document insights and visualization
- Batch Processing: Multiple document upload optimization
- Export Features: Conversation and insight export functionality
- API Endpoints: RESTful API for programmatic access
- Authentication: User management and access control
- Cloud Storage: Integration with AWS S3, Google Drive
- Advanced Search: Full-text search and filtering capabilities
- Multi-Language Support: International language processing
- Custom Models: Support for local LLM deployment
- Enterprise Features: Advanced security and compliance
- Mobile App: Native mobile application development
This comprehensive documentation provides everything needed to understand, use, develop, and deploy the Enhanced RAG PDF Chat application. Each section is designed to be self-contained while linking to related concepts throughout the documentation.
For the latest updates and additional resources, visit the [GitHub repository](https://github.com/saktisriraj/pdf-knowledge-extractor) and [project wiki](https://github.com/saktisriraj/pdf-knowledge-extractor/wiki).