Performance Benchmarking Guide - osama1998H/Moca GitHub Wiki

Benchmarking Guide

Moca has a three-tier benchmarking suite that measures individual functions, composed request pipelines, and underlying infrastructure. CI automatically detects performance regressions on every pull request.

The Three Tiers

Tier 1 — Critical Hot Path

Benchmarks for functions that execute on every single API request. A regression here affects every user of every tenant.

What's Tested	Why It Matters
MetaType registry lookup (sync.Map, Redis, PostgreSQL fallback)	Called on every document operation. L1 cache hit should be near-instant.
Document CRUD (Get, GetList, Insert)	The core read and write operations behind every API endpoint.
SQL query builder	Generates parameterized queries for every list view and document fetch.
HTTP middleware chain	RequestID, CORS, tenant resolution, auth, rate limiting -- every request traverses this.

Tier 2 — Per-Request Components

Benchmarks for functions called within the hot path but not necessarily on every request.

What's Tested	Why It Matters
Field validation	Type coercion and validation rules per field. Cost scales with field count.
Document naming	Pattern-based naming uses PostgreSQL sequences. Contention risk under concurrency.
Lifecycle event dispatch	Switch dispatch across 14 lifecycle events during document writes.
Hook resolution	Topological sort of hooks by dependency and priority. Cost grows with installed apps.
Rate limiting	Redis sliding-window commands. Network-bound.
Request/response transformation	Field filtering and aliasing. Cost scales with field and transformer count.
Database transactions	Begin, execute, commit/rollback overhead measurement.

Tier 3 — Infrastructure

Benchmarks that measure the underlying systems in isolation. These establish baselines so that when a Tier 1 or Tier 2 benchmark regresses, you can determine whether the cause is in the application code or the infrastructure layer.

What's Tested	Why It Matters
PostgreSQL round-trip (INSERT, SELECT)	Raw database latency baseline. Helps isolate DB vs app-layer regressions.
Redis GET/SET (single, pipeline, parallel)	Raw cache driver latency. Helps isolate Redis vs business-logic regressions.
Connection pool under load (1-500 goroutines)	Detects pool exhaustion and lock contention. Key for tuning `max_conns`.
DDL generation (10, 50, 100 fields)	Runs during migrations. Super-linear scaling indicates algorithmic problems.
Schema compilation (5, 50, 100 fields)	Runs during app install and cache rebuild. Regression slows cold starts.
Document insert concurrency (1-500 goroutines)	Detects bottlenecks in the full write path (naming, validation, transaction, hooks).

How to Run Benchmarks

Command	What It Does	Docker Required?
`make bench`	Run all benchmarks that don't need external services (5 iterations)	No
`make bench-integration`	Start Docker services, run all benchmarks including DB/Redis (10 iterations)	Yes
`make bench-compare`	Run benchmarks and compare against saved baseline using benchstat	No
`make bench-save-baseline`	Run benchmarks and save results as the comparison baseline	No
`make bench-profile`	Capture CPU and memory profiles for a specific benchmark	No

How to Read Results

A typical benchmark output line looks like:

BenchmarkRegistryGet_L1Hit-8    26492102    45.2 ns/op    0 B/op    0 allocs/op

Column	Meaning
`BenchmarkRegistryGet_L1Hit-8`	Benchmark name. `-8` means it ran with `GOMAXPROCS=8`.
`26492102`	Number of iterations the benchmark ran. More iterations = more statistical confidence.
`45.2 ns/op`	Nanoseconds per operation. The primary performance metric. Lower is better.
`0 B/op`	Bytes allocated per operation. Tracks memory pressure. Lower is better.
`0 allocs/op`	Heap allocations per operation. Each allocation adds GC pressure. Zero is ideal for hot paths.

What "good" looks like

Tier 1 (L1 cache hits): < 200 ns/op, 0 allocs/op
Tier 1 (database operations): < 5 ms/op
Tier 2 (CPU-only): < 50 us/op
Tier 3 (infrastructure): Stable across runs. Used as a baseline, not judged in isolation.

Reading benchstat comparison output

When comparing two runs, benchstat shows:

name                old time/op    new time/op    delta
RegistryGet_L1Hit   45.2ns +- 2%   44.8ns +- 1%    ~  (p=0.421 n=10+10)
DocManagerInsert    1.23ms +- 3%   1.58ms +- 2%  +28.46%  (p=0.000 n=10+10)

~ means no statistically significant change (good).
+28.46% means a 28% regression (investigate).
p=0.000 means the change is statistically significant (not noise).
n=10+10 means 10 samples from each run were compared.

CI Regression Detection

Every pull request that changes code in pkg/, internal/, or cmd/ triggers the benchmark workflow:

Base branch benchmarks run in a git worktree at the PR's base commit (10 iterations)
PR branch benchmarks run in a separate worktree at the PR's head commit (10 iterations)
benchstat compares the two runs for statistically significant changes
A structured PR comment is posted with:
- Summary table grouped by tier (only changed benchmarks shown)
- Status icons: :red_circle: regression >= 10%, :yellow_circle: regression 5-10%, :green_circle: improvement
- Performance budget proximity table (how close critical paths are to their hard limits)
- Full raw benchstat output in a collapsed section
The check fails if any benchmark regresses by 10% or more

Performance Budgets

Four critical benchmarks have hard performance limits enforced as normal tests (run by make test, no Docker needed):

Benchmark	Budget	What It Guards
`RegistryGet_L1Hit`	200 ns/op	sync.Map cache lookup must stay near-instant
`GenerateTableDDL` (10 fields)	50 us/op	DDL generation must not slow migrations
`TransformerChain_Response`	20 us/op	Response transformation must not add API latency
`HookRegistryResolve_10Hooks`	5 us/op	Hook resolution must stay fast as apps are installed

If any budget is exceeded, make test fails. These catch absolute performance violations regardless of relative change -- even if a regression is small per-PR, repeated small regressions that cross a budget are caught.

Profiling a Regression

When a benchmark regresses, use profiling to find the cause:

# Interactive: prompts for benchmark pattern and package
make bench-profile

# Or run directly:
go test -run=^$ -bench=BenchmarkDocManagerInsert -cpuprofile=cpu.prof -memprofile=mem.prof -benchmem ./pkg/document/...

# View CPU profile in browser
go tool pprof -http=:8080 cpu.prof

# View memory profile in browser
go tool pprof -http=:8080 mem.prof

In the flame graph, look for:

Wide bars at the top = functions consuming the most time
Unexpected function calls in hot paths (e.g., reflection, JSON marshaling where there shouldn't be any)
Allocation-heavy functions in the memory profile = GC pressure sources

For concurrency issues, use the trace tool:

go test -run=^$ -bench=BenchmarkDocManagerInsert_Parallel -trace=trace.out ./pkg/document/...
go tool trace trace.out

Look for goroutine blocking, mutex contention, and scheduler delays in the trace viewer.