Repository Analysis - Chris-Cullins/wiki_bot GitHub Wiki
Repository Analysis
Overview
The Repository Analysis area provides the core infrastructure for discovering, traversing, and understanding the structure of code repositories. It serves as the foundation for all documentation generation by crawling file systems, respecting ignore patterns, and building hierarchical representations of project structure. This area also manages the integration with various LLM providers to power intelligent documentation generation.
Primary Responsibilities:
- Crawl and index repository file structures while respecting
.gitignorepatterns - Build hierarchical tree representations of project files and directories
- Provide abstraction layer for multiple LLM providers (Agent SDK, Claude CLI, Codex CLI)
- Manage git repository operations including cloning, status checking, and synchronization
- Load and validate application configuration from environment variables
Key Components
Repository Crawling (src/repo-crawler.ts)
RepoCrawler class: Traverses file systems and builds structured representations of repositories. Handles ignore pattern filtering and path normalization across platforms.
crawl(repoPath: string): Entry point that initiates repository traversalgetFilePaths(tree: FileNode): Extracts flat list of all file paths from tree structureFileNodeinterface: Represents files and directories with hierarchical relationships
LLM Provider Integration (src/query-factory.ts)
createQueryFunction(): Factory that returns the appropriate query implementation based on configuration. Supports three provider modes:
- Agent SDK: Direct integration with Anthropic's Agent SDK
- Claude CLI: Executes
claudeCLI with repository context - Codex CLI: Executes
codex execfor prompt processing
Each provider returns a standardized Query iterator interface for consistent consumption.
Git Repository Management (src/github/git-repository-manager.ts)
GitRepositoryManager class: Handles all git operations with support for multiple operational modes:
- Fresh mode: Always clone from scratch
- Incremental mode: Update existing or clone if missing
- Reuse-or-clone mode: Use existing repository without modification
Provides status checking, commit creation, and push operations with credential sanitization.
Configuration Management (src/config.ts)
loadConfig(): Centralizes all application settings loaded from environment variables. Validates required credentials and provides sensible defaults.
Config interface: Type-safe configuration object covering:
- API credentials and endpoints
- Repository paths and URLs
- Documentation generation settings
- Debugging and logging options
Application Entry Point (src/index.ts)
main(): Orchestrates the complete documentation generation workflow:
- Parse CLI arguments and load configuration
- Crawl repository structure
- Initialize wiki generator with appropriate LLM provider
- Generate documentation pages (Home, Architecture, Areas)
- Write results to wiki repository if configured
How It Works
Repository Crawling Flow
- Initialize ignore filter: Combines default patterns (
.git/,node_modules/, etc.) with.gitignorecontents - Recursive traversal: Starting from root, iterate through directories and files
- Filter evaluation: Each path checked against ignore patterns before inclusion
- Path normalization: Convert platform-specific separators to forward slashes
- Tree construction: Build hierarchical
FileNodestructure with parent-child relationships
// Example flow through crawlDirectory
dirPath → readdir() → for each entry:
- Build full path
- Calculate relative path from root
- Normalize path separators
- Check ignore filter
- Recurse if directory, add if file
LLM Provider Abstraction
The query factory pattern allows seamless switching between providers:
// All providers implement same interface
QueryFunction: (params: { prompt: string }) => Query
// Query yields standardized messages
yield {
type: 'assistant',
content: string
}
Claude CLI implementation adds repository context via --add-dir flag and enforces system prompt for consistent output format.
Codex CLI implementation parses JSON-formatted output, extracting agent_message types from streaming response.
Git Repository State Management
The repository manager enforces clean state transitions:
- Status checking: Detects uncommitted changes, branch position, and tracking state
- Safety guards: Prevents updates when uncommitted changes exist
- Credential handling: Injects tokens into URLs for authenticated operations
- Sanitization: Redacts credentials from error messages and logs
Important Functions/Classes
RepoCrawler.crawlDirectory(dirPath: string, rootPath: string)
Recursive core of repository traversal. Returns FileNode representing directory with all children.
Key logic:
- Maintains relative paths from
rootPathfor consistency - Handles root directory special case (path =
.) - Sorts entries deterministically via
readdirorder
createClaudeCliQuery(repoPath: string)
Returns query function that executes Claude CLI with repository context. Critical for ensuring the LLM has full codebase visibility.
System prompt injection:
CLAUDE_CLI_SYSTEM_PROMPT =
'You are an expert technical writer generating polished GitHub wiki pages...'
This ensures consistent output format across all prompts.
GitRepositoryManager.prepare()
Entry point for repository initialization based on configured mode:
switch (this._mode) {
case 'fresh': clean() then clone()
case 'incremental': update() or clone()
case 'reuse-or-clone': clone() only if missing
}
loadConfig()
Configuration loader with intelligent defaults:
- Test mode detection: Allows mock operations without API keys
- Provider selection: Maps environment variable to typed enum
- Incremental docs inference: Auto-enables when mode is
incrementalorreuse-or-clone - Path resolution: Converts relative paths to absolute based on repo location
Developer Notes
Path Normalization
Critical: Always use normalizePath() when working with file paths. Windows uses backslashes but ignore patterns expect forward slashes. The crawler handles this automatically, but external code must normalize:
private normalizePath(pathCandidate: string): string {
return pathCandidate.split(sep).join('/');
}
Ignore Pattern Gotchas
The ignore library requires trailing slashes for directory patterns. The crawler checks both forms:
if (ignoreFilter.ignores(relativePath)) return true;
if (isDirectory && ignoreFilter.ignores(`${relativePath}/`)) return true;
CLI Provider Limitations
When using claude-cli or codex-cli:
- Must be installed and available on
PATH - Spawned processes have 10MB stdout buffer limit (
maxBufferinexecFileAsync) - Errors may come from either
stderrorstdoutdepending on provider
Git Authentication
Tokens are injected into URLs via username:password format:
url.username = 'x-access-token';
url.password = this._token;
GitHub specifically expects username x-access-token for PAT authentication.
Selective Regeneration
The --target-file CLI flag enables partial documentation updates:
- Resolve target files to normalized repository paths
- Load existing wiki pages for all areas
- Skip area generation unless it intersects with target files
- Auto-enable incremental docs mode
Important: Unmatched targets trigger warnings but don't fail the build.
Usage Examples
Basic Repository Crawl
import { RepoCrawler } from './repo-crawler.js';
const crawler = new RepoCrawler();
const tree = await crawler.crawl('/path/to/repo');
const allFiles = crawler.getFilePaths(tree);
console.log(`Found ${allFiles.length} files`);
// tree.children contains structured hierarchy
Using Different LLM Providers
// Via environment variables
process.env.LLM_PROVIDER = 'claude-cli';
const config = loadConfig();
const queryFn = createQueryFunction(config, repoPath);
// Execute query
const query = queryFn({
prompt: 'Explain this repository structure'
});
for await (const message of query) {
console.log(message.content);
}
Git Repository Operations
const manager = new GitRepositoryManager({
remoteUrl: 'https://github.com/user/repo.git',
localPath: '/tmp/wiki',
token: process.env.GITHUB_TOKEN,
branch: 'master',
mode: 'incremental'
});
await manager.prepare(); // Clone or update
const status = await manager.status();
if (!status.clean) {
console.log('Uncommitted changes:', status.uncommittedChanges);
}
const committed = await manager.commit('Update documentation');
if (committed) {
await manager.push();
}
Selective Documentation Update
# Regenerate docs only for areas touching specific files
node dist/index.js \
--target-file src/repo-crawler.ts \
--target-file src/config.ts \
--depth standard
This resolves target files against the repository structure and only regenerates architectural areas that include those files.