question answering of codebases - chunhualiao/public-docs GitHub Wiki
Question: what are best open source solutions to answer questions about a large scale C++ codebase, possibly using Large language models and RAG techniques? What are the standardized benchmarks and leaderboards for this codebase question answering task?
- QA of codebases:Gemini 1.5 Pro deep research
- QA of codebases:Perplexity deep research
- QA of codebases:GPT-4o
I'll create a diagram to explain how the question answering system works in DeepWiki's codebase .
graph TD
subgraph "User Interface"
A["User asks question about codebase"]
B["Frontend sends query via WebSocket/HTTP"]
end
subgraph "API Layer"
C["WebSocket Handler<br/>(/ws/chat)"]
D["HTTP Stream Handler<br/>(/chat/completions/stream)"]
end
subgraph "RAG Processing Pipeline"
E["Create RAG Instance"]
F["Prepare Repository Retriever"]
G["Query Vector Database<br/>(FAISS embeddings)"]
H["Retrieve Relevant Code Snippets"]
I["Group Documents by File Path"]
J["Format Context with Headers"]
end
subgraph "Prompt Construction"
K["Build Comprehensive Prompt"]
L["Include Repository Context"]
M["Add Query + Conversation History"]
N["Apply Language Preferences"]
end
subgraph "AI Model Providers"
O["Google Gemini"]
P["OpenAI"]
Q["OpenRouter"]
R["Ollama Local"]
S["Azure OpenAI"]
end
subgraph "Response Processing"
T["Stream AI Response"]
U["Handle Token Limits"]
V["Apply Deep Research Logic"]
W["Return Formatted Answer"]
end
subgraph "Storage Layer"
X["Code Embeddings<br/>(~/.adalflow/databases/)"]
Y["Repository Cache<br/>(~/.adalflow/repos/)"]
Z["Conversation Memory"]
end
A --> B
B --> C
B --> D
C --> E
D --> E
E --> F
F --> G
G --> X
G --> H
H --> I
I --> J
J --> K
K --> L
L --> M
M --> N
N --> O
N --> P
N --> Q
N --> R
N --> S
O --> T
P --> T
Q --> T
R --> T
S --> T
T --> U
U --> V
V --> W
F --> Y
K --> Z
W --> B
B --> A
classDef userLayer fill:#e1f5fe
classDef apiLayer fill:#f3e5f5
classDef ragLayer fill:#e8f5e8
classDef promptLayer fill:#fff3e0
classDef aiLayer fill:#fce4ec
classDef responseLayer fill:#f1f8e9
classDef storageLayer fill:#f5f5f5
class A,B userLayer
class C,D apiLayer
class E,F,G,H,I,J ragLayer
class K,L,M,N promptLayer
class O,P,Q,R,S aiLayer
class T,U,V,W responseLayer
class X,Y,Z storageLayer
RAG Processing Pipeline: The system creates code embeddings stored in ~/.adalflow/databases/
and uses FAISS for vector similarity search to find relevant code snippets 1 .
Dual API Architecture: Questions can be processed through either WebSocket connections (/ws/chat
) or HTTP streaming (/chat/completions/stream
) endpoints, both implementing the same core RAG logic .
Multi-Provider Support: The system supports multiple AI providers including Google Gemini, OpenAI, OpenRouter, Azure OpenAI, and local Ollama models 2 .
Context-Aware Responses: Retrieved code snippets are grouped by file path and formatted with clear headers before being sent to the AI model for context-aware documentation generation 3 .
Advanced Features: The system includes Deep Research capabilities for complex queries and file-specific query handling when users specify particular file paths .
This diagram illustrates the complete flow from user question to AI-generated answer, showing how DeepWiki combines vector search with large language models to provide accurate, context-aware responses about codebases. The system maintains conversation history and supports multiple languages based on user preferences.
Wiki pages you might want to explore: