Repository Discovery - Chris-Cullins/wiki_bot GitHub Wiki
Repository Discovery
-
Scope: Repository discovery supplies the source-of-truth file tree consumed by every downstream wiki workflow, aligning with the mission in
AGENTS.md:1and the architecture goals inARCHITECTURE.md:1. It determines what Claude can “see,” which directly controls documentation accuracy. -
Primary Workflow:
src/index.ts:78loads configuration, resolves the repo root, and runsRepoCrawler.crawl. The resultingFileNodegraph and flattened path list feed selective regeneration checks, area extraction prompts, and markdown generation. -
Key Data Model:
FileNode(src/repo-crawler.ts:8) represents each directory or file with normalized POSIX-like paths, enabling deterministic prompt input and template rendering regardless of host OS. -
Downstream Contract: Wiki generation expects
RepoCrawler.getFilePaths(src/repo-crawler.ts:128) to deliver a stable ordering of repo-relative paths. Any change to path normalization or ignore rules must maintain compatibility withWikiGenerator.identifyRelevantFiles(src/wiki-generator.ts:206) and selective regeneration logic insrc/index.ts:153.
RepoCrawler Module
-
Ignore Strategy:
defaultIgnorePatterns(src/repo-crawler.ts:19) natively filters generated artifacts such as.wiki/and respects user-specified rules via.gitignoreaggregation insidebuildIgnoreFilter(src/repo-crawler.ts:81). This prevents recursive documentation of derived outputs and keeps Claude prompts focused. -
Traversal Logic:
crawlDirectory(src/repo-crawler.ts:44) performs depth-first traversal, skipping ignored entries before recursing. Directory nodes are renamed to their final path segment for human-readable tree printing while keeping the full normalized path inpath. -
Path Normalization:
normalizePath(src/repo-crawler.ts:117) converts platform-specific separators to/, guaranteeing prompt portability. Always pass repo-relative paths to avoid leaking host system structure. -
Public API:
crawl(repoPath)boots the ignore filter, returns a rootedFileNode, and must be awaited before callinggetFilePaths. UsegetFilePathsto generate deterministic arrays for CLI selective mode, architectural area prompts, and All Files listings in prompt templates.
Configuration Inputs
-
Repo Location:
loadConfig(src/config.ts:36) acceptsREPO_PATHand defaults toprocess.cwd().src/index.ts:122resolves this path before crawling, so provide absolute or repo-relative paths via environment or CLI. -
Incremental Modes:
WIKI_REPO_MODEandINCREMENTAL_DOCStoggled inloadConfig(src/config.ts:78) instruct the crawler to reuse prior docs. Selective runs (src/index.ts:143) automatically enable incremental docs when targets are found. -
Prompt Logging: Enabling
DEBUGorPROMPT_LOG_ENABLED(src/config.ts:52) ensures repository structure snapshots written to prompt logs mirror crawler output. This aids debugging mismatched file mappings.
CLI Integration
-
Argument Parsing:
parseCliArgs(src/index.ts:25) supports--target-filefor selective regeneration.resolveTargetFiles(src/index.ts:94) cross-references requested files against crawler output, so ensure crawler ignore patterns include everything Claude should avoid; otherwise a missing file prevents targeted updates. -
Selective Runs: When targets are provided,
src/index.ts:165switches to incremental doc updates, preloads existing wiki pages, and only regenerates areas whose relevant files intersect the requested set—lowering API cost and build times. -
Prompt Preparation: The formatted tree produced by
RepoCrawler.formatRepoStructureusage (src/wiki-generator.ts:273) becomes the structural context embedded in prompts likeidentify-relevant-files.md:1, so maintain a concise yet informative hierarchy.
Usage Example
import { RepoCrawler } from './repo-crawler.js';
async function inspectRepo(root: string) {
const crawler = new RepoCrawler();
const tree = await crawler.crawl(root);
const files = crawler.getFilePaths(tree);
console.log(`Discovered ${files.length} files`, files.slice(0, 10));
}- Call this helper before constructing wiki prompts or when verifying ignore behavior changes. Keep the repo path aligned with
loadConfig().repoPathto avoid mismatched crawls.
Extensibility Notes
-
Custom Filters: Extend
defaultIgnorePatternscautiously; adding globs likedocs/may hide files needed for documentation. Prefer augmenting.gitignoreso repository owners control visibility without code changes. -
Performance: The crawler reads each directory sequentially. For very large repos, consider batching
readdirresults or memoizingshouldIgnore, but maintain deterministic ordering to avoid prompt deltas between runs. -
Testing Hooks: When building unit tests for discovery, stub
fs/promisesoperations and inject synthetic.gitignorecontent to validate ignore precedence (src/repo-crawler.ts:86). Pair with fixture trees to ensurenormalizePathhandles Windows-style inputs. -
Future Enhancements: Planned incremental, diff-based documentation (see roadmap in
README.md:69) will require augmenting the crawler to expose file metadata (mtime, size) so the agent can short-circuit unchanged directories.
Next steps developers often take after updating discovery logic:
- Run
npm run type-checkto confirm TypeScript signatures remain synchronized with WikiGenerator usage. - Execute
npm run dev -- --target-file <path>against a small sample repo to validate selective regeneration behavior.