Benchmarks - Azure/az-prototype GitHub Wiki

Benchmarks

Overview

The benchmark suite measures AI-generated code quality across 14 dimensions. It is project-agnostic and can be applied to any multi-stage build pipeline. Both GitHub Copilot and Claude Code are evaluated against identical input prompts produced by the az prototype build pipeline.

How Scores Are Generated

Both tools receive identical input prompts produced by the extension's prompt construction pipeline. GitHub Copilot processes these natively during a live build run. The same prompts are extracted from the debug log and submitted verbatim to Claude Code. Each tool's responses are scored independently against 14 benchmarks with 4-5 weighted sub-factors totaling 100 points each. Because both tools see the exact same input, score differences reflect genuine output quality differences.

14 Benchmarks

ID	Name	Description
B-INST	Instruction Adherence	Does the output implement exactly what was requested?
B-CNST	Constraint Compliance	Are NEVER/MUST/CRITICAL directives followed?
B-TECH	Technical Correctness	Is the code syntactically valid and deployable?
B-SEC	Security Posture	Are security best practices followed?
B-OPS	Operational Readiness	Are deploy scripts production-grade?
B-DEP	Dependency Hygiene	Are dependencies minimal and correctly versioned?
B-SCOPE	Scope Discipline	Does the output stay within requested boundaries?
B-QUAL	Code Quality	Is the code well-organized and maintainable?
B-OUT	Output Completeness	Are all required interfaces properly defined?
B-CONS	Cross-Stage Consistency	Are patterns uniform across all stages?
B-DOC	Documentation Quality	Are docs complete, accurate, and actionable?
B-REL	Response Reliability	Is the response complete and parseable?
B-RBAC	RBAC & Identity	Are identity/role patterns correct per service?
B-ANTI	Anti-Pattern Absence	Are known bad patterns absent from output?

Each benchmark has 4-5 weighted sub-factors. Full scoring rubrics are in benchmarks/README.md.

Testing Workflow

Run az prototype build --debug via GitHub Copilot
Extract stage prompts and responses from the debug log
Submit each prompt to Claude Code and save the responses
Score both response sets against the 14 benchmarks
Generate reports

See benchmarks/INSTRUCTIONS.md for detailed steps, extraction scripts, and copy-paste analysis instructions.

Reports

File	Purpose	Updated
`benchmarks/YYYY-MM-DD-HH-mm-ss.html`	Per-run benchmark report with stage tabs	Every run
`benchmarks/overall.html`	Trends dashboard with per-benchmark detail tabs	On instruction only
`benchmarks/YYYY-MM-DD_Benchmark_Report.pdf`	PDF report from TEMPLATE.docx with embedded charts	On instruction only

Individual run reports may be generated at any time for testing. The trends dashboard and PDF are only updated when explicitly instructed.

PDF Generation

python scripts/generate_pdf.py

This populates benchmarks/TEMPLATE.docx with scores, generates 29 matplotlib charts (1 overall trend + 14 factor comparisons + 14 score trends), embeds them, converts to PDF via docx2pdf, and cleans up the temporary DOCX.

Score Ratings

Rating	Range	Action
Excellent	90-100	Monitor for regressions
Good	75-89	Review specific sub-criteria
Acceptable	60-74	Prioritize improvements
Poor	40-59	Investigate root causes
Failing	0-39	Immediate investigation required