Technical Architecture - sbuddharaju369/WebsiteAnalyzer GitHub Wiki
System Architecture Overview
The Web Content Analyzer employs a sophisticated multi-layered architecture built around a Retrieval-Augmented Generation (RAG) pipeline with ChromaDB vector storage, OpenAI embeddings, and intelligent content processing. The system is designed for production deployment with emphasis on performance, reliability, and semantic accuracy.
Core Technology Stack & Design Decisions
Web Crawling Engine
- Primary Tools: Beautiful Soup 4 + Trafilatura + Requests
- Architecture Decision: Trafilatura was chosen over alternatives like newspaper3k or raw BeautifulSoup because it provides superior content extraction quality by specifically targeting main content areas while filtering out navigation, advertisements, and boilerplate text. This results in 40-60% cleaner content compared to generic HTML parsing.
- Crawling Strategy:
-
- Implements breadth-first search with intelligent queue management
-
- Domain scoping prevents external link following using urlparse() validation
-
- Rate limiting with configurable delays (0.5-5 seconds) respects server resources
-
- Robots.txt compliance through header inspection
-
- Maximum page limits (1-100) provide resource control
- Link Discovery Algorithm:
# Enhanced link extraction targeting 30+ links per page
def _extract_links(self, html: str, base_url: str) -> Set[str]:
soup = BeautifulSoup(html, 'html.parser')
links = set()
for tag in soup.find_all(['a', 'link'], href=True):
href = urljoin(base_url, tag['href'])
if self._is_valid_url(href, base_domain):
links.add(href.split('#')[0]) # Remove fragments
Vector Database & Embedding Architecture
ChromaDB Selection Rationale
ChromaDB was chosen over alternatives like Pinecone, Weaviate, or Faiss because:
- Local deployment eliminates external API dependencies
- Persistent storage with SQLite backend ensures data durability
- Native metadata support enables rich filtering and attribution
- Open-source licensing provides deployment flexibility
- Python-native integration reduces complexity
Embedding Model: OpenAI's text-embedding-ada-002
Justification: Despite newer models available, ada-002 provides optimal balance of:
- 1536-dimensional vectors offering rich semantic representation
- Cost efficiency at .0001/1K tokens
- Proven performance across diverse content types
- Stable API ensuring consistent embeddings over time
Intelligent Text Chunking Strategy
Chunking Parameters
- Chunk Size: 600 tokens (down from initial 1000)
- Overlap: 100 tokens (16.7% overlap ratio)
- Encoding: tiktoken with cl100k_base (GPT-4 tokenizer)
Advanced Chunking Algorithm:
def _smart_chunk_text(self, text: str, max_tokens: int = 600, overlap_tokens: int = 100):
# 1. Primary split on double newlines (paragraphs)
paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
`# 2. Semantic boundary preservation`
`for paragraph in paragraphs:`
`if self._count_tokens(paragraph) <= max_tokens:`
`# Paragraph fits - preserve intact`
`else:`
`# Split on sentences using regex`
`sentences = re.split(r'[.!?]+', paragraph)`
`# Reconstruct maintaining context`
Technical Justification:
- 600 tokens optimal for semantic coherence while maintaining context
- Paragraph-first splitting preserves logical content boundaries
- Sentence-level fallback prevents mid-sentence cuts
- Overlap ensures context continuity across chunks
- Token counting with tiktoken ensures accurate OpenAI API compatibility
Embedding Caching Architecture
Implementation Strategy: Dual-layer caching system
- Runtime Cache: In-memory storage during session
- Persistent Cache: JSON files with embedded vectors
- Cache File Structure:
{
"domain": "example.com",
"timestamp": "2025-06-19T15:30:00Z",
"total_pages": 25,
"content": [
{
"url": "https://example.com/page1",
"title": "Page Title",
"content": "Full text content...",
"word_count": 1247,
"embeddings": [0.1234, -0.5678, ...] // 1536-dimensional vector
}
]
}
Performance Impact:
- Eliminates redundant OpenAI API calls (saves .10-.00 per reload)
- Enables instant analysis restart without re-crawling
- Reduces session initialization time from 30-60 seconds to <5 seconds
Large Language Model Integration
Primary LLM: OpenAI GPT-4o
Selection Rationale:
GPT-4o chosen over GPT-4-turbo or Claude because:
- Superior instruction following for structured outputs
- Enhanced context window (128K tokens) handles large content sets
- Optimized for reasoning tasks with higher accuracy
- Multimodal capabilities for future image analysis features
- Cost-effective at .50/1M input tokens
Prompt Engineering Architecture:
System Prompt Foundation:
system_prompt = """You are a helpful assistant that analyzes website content and provides accurate, well-sourced answers based solely on the provided information. Always cite which sources you're drawing from."""
Dynamic Verbosity Prompts:
- Concise (max_tokens=400): "provide a brief, focused answer"
- Balanced (max_tokens=800): "provide a balanced, informative answer"
- Comprehensive (max_tokens=1200): "provide a detailed, thorough analysis with specific examples"
Context Assembly Strategy:
def analyze_content(self, question: str, verbosity: str = 'concise'):
relevant_chunks = self._semantic_search(question, k=5)
context_parts = []
for item in relevant_chunks:
context_parts.append(f"Source: {metadata['title']}\nContent: {chunk}")
context = "\n\n---\n\n".join(context_parts)
Semantic Search Implementation
Vector Similarity Algorithm:
ChromaDB uses cosine similarity with L2 normalization
Search Pipeline:
- Query Embedding: Input question → OpenAI ada-002 → 1536-dim vector
- Similarity Search: ChromaDB.query() with cosine distance
- Result Ranking: Distance conversion to similarity score
- Metadata Enrichment: Source attribution and relevance scoring
Similarity Score Calculation:
similarity = max(0, 1 - distance) # Convert ChromaDB distance to similarity
confidence = min(avg_relevance * 2, 1.0) # Scale to 0-1 range
Reliability & Confidence Scoring
Multi-Factor Confidence Algorithm:
- Vector Similarity: Cosine similarity between query and retrieved chunks
- Source Diversity: Number of different pages contributing to answer
- Content Overlap: Degree of information consistency across sources
- Query Specificity: Token overlap between question and content
Confidence Categorization:
- Very Reliable (0.8-1.0): High similarity + multiple sources
- Mostly Reliable (0.6-0.8): Good similarity + sufficient context
- Moderately Reliable (0.4-0.6): Decent similarity + limited sources
- Less Reliable (<0.4): Low similarity + sparse context
Website Size Estimation
Multi-Source Estimation Strategy:
- Sitemap Analysis: Parse XML sitemaps for authoritative page counts
- Robots.txt Discovery: Extract sitemap URLs from robots.txt
- Nested Sitemap Recursion: Follow sitemap index files
- Heuristic Fallbacks: Link density analysis for estimation
Implementation:
def estimate_total_pages(self, start_url: str) -> Dict[str, Any]:
estimates = []
`# Primary: Sitemap analysis`
`sitemap_count = self._analyze_sitemaps(base_url, domain)`
`if sitemap_count > 0:`
`estimates.append(('sitemap', sitemap_count))`
`# Secondary: Link density heuristic`
`initial_links = len(self._extract_links(initial_html, start_url))`
`heuristic_estimate = initial_links * 2.5 # Empirical multiplier`
`return {`
`'estimated_pages': max(estimates) if estimates else heuristic_estimate,`
`'confidence': 'high' if sitemap_count else 'medium',`
`'sources': [source for source, _ in estimates]`
`}`
Network Graph Generation
Graph Construction Algorithm:
def create_network_graph(self, content):
nodes = []
edges = []
`# Node creation with content-based sizing`
`for i, page in enumerate(content[:15]): # Limit for visualization clarity`
`node_size = max(10, min(30, page.get('word_count', 100) / 50))`
`nodes.append(Node(`
`id=str(i),`
`label=title[:20] + "..." if len(title) > 20 else title,`
`size=node_size,`
`color="#1f77b4"`
`))`
`# Edge creation based on URL similarity`
`for i, page1 in enumerate(content[:15]):`
`for j, page2 in enumerate(content[:15]):`
`if i != j:`
`url1_parts = set(page1.get('url', '').split('/'))`
`url2_parts = set(page2.get('url', '').split('/'))`
`shared_parts = url1_parts.intersection(url2_parts)`
`# Create edge if significant path overlap`
`if len(shared_parts) > 2:`
`edges.append(Edge(source=str(i), target=str(j)))`
Visualization Technology: Streamlit-agraph with physics-based layout
- Node Sizing: Proportional to content word count
- Edge Creation: Based on URL path similarity heuristics
- Layout Algorithm: Force-directed with collision detection
- Color Coding: Semantic clustering by content type
Real-Time Progress Tracking
Multi-Metric Progress System:
def progress_callback(visited, extracted, current_url=None, page_title=None):
progress_metrics = {
'pages_discovered': len(url_queue),
'pages_visited': visited,
'content_extracted': extracted,
'success_rate': extracted / visited if visited > 0 else 0,
'current_page': current_url,
'eta_seconds': self._calculate_eta(visited, total_estimated)
}
Performance Charts: Real-time Plotly visualizations showing:
- Cumulative pages crawled over time
- Content extraction success rate
- Crawling velocity (pages/minute)
- Queue depth and processing pipeline status
Data Persistence & Cache Management
File Naming Convention:
def generate_cache_filename(self, domain: str) -> str:
now = datetime.now()
formatted_date = now.strftime("%b-%d-%Y").lower()
formatted_time = now.strftime("%I-%M%p").lower()
return f"{domain}_{formatted_date}_{formatted_time}_{len(content)}pages.json"
Example: verizon_jun-19-2025_3-45pm_25pages.json
Cache Optimization Strategies:
- Gzip compression for large content sets (40-60% size reduction)
- Incremental updates for partial re-crawls
- Metadata indexing for fast cache browsing
- Automatic cleanup of stale cache files (>30 days)
Error Handling & Resilience
Fault Tolerance Design:
- Graceful degradation when ChromaDB unavailable
- Request retry logic with exponential backoff
- Content extraction fallbacks (Trafilatura → BeautifulSoup → raw text)
- OpenAI API rate limiting and error recovery
Monitoring & Observability:
- Real-time error tracking in progress interface
- Confidence scoring alerts for low-quality responses
- Performance metrics logging for optimization
- Debug mode with detailed operation tracing
This architecture provides enterprise-grade reliability while maintaining the flexibility needed for diverse website analysis scenarios. The technical choices prioritize semantic accuracy, performance optimization, and user experience while ensuring scalable deployment across various environments.