Repository Analysis - Chris-Cullins/wiki_bot GitHub Wiki

Repository Analysis

Overview

The Repository Analysis area provides the core infrastructure for discovering, traversing, and understanding the structure of code repositories. It serves as the foundation for all documentation generation by crawling file systems, respecting ignore patterns, and building hierarchical representations of project structure. This area also manages the integration with various LLM providers to power intelligent documentation generation.

Primary Responsibilities:

  • Crawl and index repository file structures while respecting .gitignore patterns
  • Build hierarchical tree representations of project files and directories
  • Provide abstraction layer for multiple LLM providers (Agent SDK, Claude CLI, Codex CLI)
  • Manage git repository operations including cloning, status checking, and synchronization
  • Load and validate application configuration from environment variables

Key Components

Repository Crawling (src/repo-crawler.ts)

RepoCrawler class: Traverses file systems and builds structured representations of repositories. Handles ignore pattern filtering and path normalization across platforms.

  • crawl(repoPath: string): Entry point that initiates repository traversal
  • getFilePaths(tree: FileNode): Extracts flat list of all file paths from tree structure
  • FileNode interface: Represents files and directories with hierarchical relationships

LLM Provider Integration (src/query-factory.ts)

createQueryFunction(): Factory that returns the appropriate query implementation based on configuration. Supports three provider modes:

  1. Agent SDK: Direct integration with Anthropic's Agent SDK
  2. Claude CLI: Executes claude CLI with repository context
  3. Codex CLI: Executes codex exec for prompt processing

Each provider returns a standardized Query iterator interface for consistent consumption.

Git Repository Management (src/github/git-repository-manager.ts)

GitRepositoryManager class: Handles all git operations with support for multiple operational modes:

  • Fresh mode: Always clone from scratch
  • Incremental mode: Update existing or clone if missing
  • Reuse-or-clone mode: Use existing repository without modification

Provides status checking, commit creation, and push operations with credential sanitization.

Configuration Management (src/config.ts)

loadConfig(): Centralizes all application settings loaded from environment variables. Validates required credentials and provides sensible defaults.

Config interface: Type-safe configuration object covering:

  • API credentials and endpoints
  • Repository paths and URLs
  • Documentation generation settings
  • Debugging and logging options

Application Entry Point (src/index.ts)

main(): Orchestrates the complete documentation generation workflow:

  1. Parse CLI arguments and load configuration
  2. Crawl repository structure
  3. Initialize wiki generator with appropriate LLM provider
  4. Generate documentation pages (Home, Architecture, Areas)
  5. Write results to wiki repository if configured

How It Works

Repository Crawling Flow

  1. Initialize ignore filter: Combines default patterns (.git/, node_modules/, etc.) with .gitignore contents
  2. Recursive traversal: Starting from root, iterate through directories and files
  3. Filter evaluation: Each path checked against ignore patterns before inclusion
  4. Path normalization: Convert platform-specific separators to forward slashes
  5. Tree construction: Build hierarchical FileNode structure with parent-child relationships
// Example flow through crawlDirectory
dirPath → readdir() → for each entry:
  - Build full path
  - Calculate relative path from root
  - Normalize path separators
  - Check ignore filter
  - Recurse if directory, add if file

LLM Provider Abstraction

The query factory pattern allows seamless switching between providers:

// All providers implement same interface
QueryFunction: (params: { prompt: string }) => Query

// Query yields standardized messages
yield {
  type: 'assistant',
  content: string
}

Claude CLI implementation adds repository context via --add-dir flag and enforces system prompt for consistent output format.

Codex CLI implementation parses JSON-formatted output, extracting agent_message types from streaming response.

Git Repository State Management

The repository manager enforces clean state transitions:

  • Status checking: Detects uncommitted changes, branch position, and tracking state
  • Safety guards: Prevents updates when uncommitted changes exist
  • Credential handling: Injects tokens into URLs for authenticated operations
  • Sanitization: Redacts credentials from error messages and logs

Important Functions/Classes

RepoCrawler.crawlDirectory(dirPath: string, rootPath: string)

Recursive core of repository traversal. Returns FileNode representing directory with all children.

Key logic:

  • Maintains relative paths from rootPath for consistency
  • Handles root directory special case (path = .)
  • Sorts entries deterministically via readdir order

createClaudeCliQuery(repoPath: string)

Returns query function that executes Claude CLI with repository context. Critical for ensuring the LLM has full codebase visibility.

System prompt injection:

CLAUDE_CLI_SYSTEM_PROMPT = 
  'You are an expert technical writer generating polished GitHub wiki pages...'

This ensures consistent output format across all prompts.

GitRepositoryManager.prepare()

Entry point for repository initialization based on configured mode:

switch (this._mode) {
  case 'fresh': clean() then clone()
  case 'incremental': update() or clone()
  case 'reuse-or-clone': clone() only if missing
}

loadConfig()

Configuration loader with intelligent defaults:

  • Test mode detection: Allows mock operations without API keys
  • Provider selection: Maps environment variable to typed enum
  • Incremental docs inference: Auto-enables when mode is incremental or reuse-or-clone
  • Path resolution: Converts relative paths to absolute based on repo location

Developer Notes

Path Normalization

Critical: Always use normalizePath() when working with file paths. Windows uses backslashes but ignore patterns expect forward slashes. The crawler handles this automatically, but external code must normalize:

private normalizePath(pathCandidate: string): string {
  return pathCandidate.split(sep).join('/');
}

Ignore Pattern Gotchas

The ignore library requires trailing slashes for directory patterns. The crawler checks both forms:

if (ignoreFilter.ignores(relativePath)) return true;
if (isDirectory && ignoreFilter.ignores(`${relativePath}/`)) return true;

CLI Provider Limitations

When using claude-cli or codex-cli:

  • Must be installed and available on PATH
  • Spawned processes have 10MB stdout buffer limit (maxBuffer in execFileAsync)
  • Errors may come from either stderr or stdout depending on provider

Git Authentication

Tokens are injected into URLs via username:password format:

url.username = 'x-access-token';
url.password = this._token;

GitHub specifically expects username x-access-token for PAT authentication.

Selective Regeneration

The --target-file CLI flag enables partial documentation updates:

  1. Resolve target files to normalized repository paths
  2. Load existing wiki pages for all areas
  3. Skip area generation unless it intersects with target files
  4. Auto-enable incremental docs mode

Important: Unmatched targets trigger warnings but don't fail the build.

Usage Examples

Basic Repository Crawl

import { RepoCrawler } from './repo-crawler.js';

const crawler = new RepoCrawler();
const tree = await crawler.crawl('/path/to/repo');
const allFiles = crawler.getFilePaths(tree);

console.log(`Found ${allFiles.length} files`);
// tree.children contains structured hierarchy

Using Different LLM Providers

// Via environment variables
process.env.LLM_PROVIDER = 'claude-cli';
const config = loadConfig();
const queryFn = createQueryFunction(config, repoPath);

// Execute query
const query = queryFn({ 
  prompt: 'Explain this repository structure' 
});

for await (const message of query) {
  console.log(message.content);
}

Git Repository Operations

const manager = new GitRepositoryManager({
  remoteUrl: 'https://github.com/user/repo.git',
  localPath: '/tmp/wiki',
  token: process.env.GITHUB_TOKEN,
  branch: 'master',
  mode: 'incremental'
});

await manager.prepare(); // Clone or update
const status = await manager.status();

if (!status.clean) {
  console.log('Uncommitted changes:', status.uncommittedChanges);
}

const committed = await manager.commit('Update documentation');
if (committed) {
  await manager.push();
}

Selective Documentation Update

# Regenerate docs only for areas touching specific files
node dist/index.js \
  --target-file src/repo-crawler.ts \
  --target-file src/config.ts \
  --depth standard

This resolves target files against the repository structure and only regenerates architectural areas that include those files.