Repository Discovery - Chris-Cullins/wiki_bot GitHub Wiki

Repository Discovery

Repository Discovery

  • Scope: Repository discovery supplies the source-of-truth file tree consumed by every downstream wiki workflow, aligning with the mission in AGENTS.md:1 and the architecture goals in ARCHITECTURE.md:1. It determines what Claude can “see,” which directly controls documentation accuracy.
  • Primary Workflow: src/index.ts:78 loads configuration, resolves the repo root, and runs RepoCrawler.crawl. The resulting FileNode graph and flattened path list feed selective regeneration checks, area extraction prompts, and markdown generation.
  • Key Data Model: FileNode (src/repo-crawler.ts:8) represents each directory or file with normalized POSIX-like paths, enabling deterministic prompt input and template rendering regardless of host OS.
  • Downstream Contract: Wiki generation expects RepoCrawler.getFilePaths (src/repo-crawler.ts:128) to deliver a stable ordering of repo-relative paths. Any change to path normalization or ignore rules must maintain compatibility with WikiGenerator.identifyRelevantFiles (src/wiki-generator.ts:206) and selective regeneration logic in src/index.ts:153.

RepoCrawler Module

  • Ignore Strategy: defaultIgnorePatterns (src/repo-crawler.ts:19) natively filters generated artifacts such as .wiki/ and respects user-specified rules via .gitignore aggregation inside buildIgnoreFilter (src/repo-crawler.ts:81). This prevents recursive documentation of derived outputs and keeps Claude prompts focused.
  • Traversal Logic: crawlDirectory (src/repo-crawler.ts:44) performs depth-first traversal, skipping ignored entries before recursing. Directory nodes are renamed to their final path segment for human-readable tree printing while keeping the full normalized path in path.
  • Path Normalization: normalizePath (src/repo-crawler.ts:117) converts platform-specific separators to /, guaranteeing prompt portability. Always pass repo-relative paths to avoid leaking host system structure.
  • Public API: crawl(repoPath) boots the ignore filter, returns a rooted FileNode, and must be awaited before calling getFilePaths. Use getFilePaths to generate deterministic arrays for CLI selective mode, architectural area prompts, and All Files listings in prompt templates.

Configuration Inputs

  • Repo Location: loadConfig (src/config.ts:36) accepts REPO_PATH and defaults to process.cwd(). src/index.ts:122 resolves this path before crawling, so provide absolute or repo-relative paths via environment or CLI.
  • Incremental Modes: WIKI_REPO_MODE and INCREMENTAL_DOCS toggled in loadConfig (src/config.ts:78) instruct the crawler to reuse prior docs. Selective runs (src/index.ts:143) automatically enable incremental docs when targets are found.
  • Prompt Logging: Enabling DEBUG or PROMPT_LOG_ENABLED (src/config.ts:52) ensures repository structure snapshots written to prompt logs mirror crawler output. This aids debugging mismatched file mappings.

CLI Integration

  • Argument Parsing: parseCliArgs (src/index.ts:25) supports --target-file for selective regeneration. resolveTargetFiles (src/index.ts:94) cross-references requested files against crawler output, so ensure crawler ignore patterns include everything Claude should avoid; otherwise a missing file prevents targeted updates.
  • Selective Runs: When targets are provided, src/index.ts:165 switches to incremental doc updates, preloads existing wiki pages, and only regenerates areas whose relevant files intersect the requested set—lowering API cost and build times.
  • Prompt Preparation: The formatted tree produced by RepoCrawler.formatRepoStructure usage (src/wiki-generator.ts:273) becomes the structural context embedded in prompts like identify-relevant-files.md:1, so maintain a concise yet informative hierarchy.

Usage Example

import { RepoCrawler } from './repo-crawler.js';

async function inspectRepo(root: string) {
  const crawler = new RepoCrawler();
  const tree = await crawler.crawl(root);
  const files = crawler.getFilePaths(tree);
  console.log(`Discovered ${files.length} files`, files.slice(0, 10));
}
  • Call this helper before constructing wiki prompts or when verifying ignore behavior changes. Keep the repo path aligned with loadConfig().repoPath to avoid mismatched crawls.

Extensibility Notes

  • Custom Filters: Extend defaultIgnorePatterns cautiously; adding globs like docs/ may hide files needed for documentation. Prefer augmenting .gitignore so repository owners control visibility without code changes.
  • Performance: The crawler reads each directory sequentially. For very large repos, consider batching readdir results or memoizing shouldIgnore, but maintain deterministic ordering to avoid prompt deltas between runs.
  • Testing Hooks: When building unit tests for discovery, stub fs/promises operations and inject synthetic .gitignore content to validate ignore precedence (src/repo-crawler.ts:86). Pair with fixture trees to ensure normalizePath handles Windows-style inputs.
  • Future Enhancements: Planned incremental, diff-based documentation (see roadmap in README.md:69) will require augmenting the crawler to expose file metadata (mtime, size) so the agent can short-circuit unchanged directories.

Next steps developers often take after updating discovery logic:

  1. Run npm run type-check to confirm TypeScript signatures remain synchronized with WikiGenerator usage.
  2. Execute npm run dev -- --target-file <path> against a small sample repo to validate selective regeneration behavior.
⚠️ **GitHub.com Fallback** ⚠️