Benchmarks - Azure/az-prototype GitHub Wiki
Benchmarks
Overview
The benchmark suite measures AI-generated code quality across 14 dimensions. It is project-agnostic and can be applied to any multi-stage build pipeline. Both GitHub Copilot and Claude Code are evaluated against identical input prompts produced by the az prototype build pipeline.
How Scores Are Generated
Both tools receive identical input prompts produced by the extension's prompt construction pipeline. GitHub Copilot processes these natively during a live build run. The same prompts are extracted from the debug log and submitted verbatim to Claude Code. Each tool's responses are scored independently against 14 benchmarks with 4-5 weighted sub-factors totaling 100 points each. Because both tools see the exact same input, score differences reflect genuine output quality differences.
14 Benchmarks
| ID | Name | Description |
|---|---|---|
| B-INST | Instruction Adherence | Does the output implement exactly what was requested? |
| B-CNST | Constraint Compliance | Are NEVER/MUST/CRITICAL directives followed? |
| B-TECH | Technical Correctness | Is the code syntactically valid and deployable? |
| B-SEC | Security Posture | Are security best practices followed? |
| B-OPS | Operational Readiness | Are deploy scripts production-grade? |
| B-DEP | Dependency Hygiene | Are dependencies minimal and correctly versioned? |
| B-SCOPE | Scope Discipline | Does the output stay within requested boundaries? |
| B-QUAL | Code Quality | Is the code well-organized and maintainable? |
| B-OUT | Output Completeness | Are all required interfaces properly defined? |
| B-CONS | Cross-Stage Consistency | Are patterns uniform across all stages? |
| B-DOC | Documentation Quality | Are docs complete, accurate, and actionable? |
| B-REL | Response Reliability | Is the response complete and parseable? |
| B-RBAC | RBAC & Identity | Are identity/role patterns correct per service? |
| B-ANTI | Anti-Pattern Absence | Are known bad patterns absent from output? |
Each benchmark has 4-5 weighted sub-factors. Full scoring rubrics are in benchmarks/README.md.
Testing Workflow
- Run
az prototype build --debugvia GitHub Copilot - Extract stage prompts and responses from the debug log
- Submit each prompt to Claude Code and save the responses
- Score both response sets against the 14 benchmarks
- Generate reports
See benchmarks/INSTRUCTIONS.md for detailed steps, extraction scripts, and copy-paste analysis instructions.
Reports
| File | Purpose | Updated |
|---|---|---|
benchmarks/YYYY-MM-DD-HH-mm-ss.html |
Per-run benchmark report with stage tabs | Every run |
benchmarks/overall.html |
Trends dashboard with per-benchmark detail tabs | On instruction only |
benchmarks/YYYY-MM-DD_Benchmark_Report.pdf |
PDF report from TEMPLATE.docx with embedded charts | On instruction only |
Individual run reports may be generated at any time for testing. The trends dashboard and PDF are only updated when explicitly instructed.
PDF Generation
python scripts/generate_pdf.py
This populates benchmarks/TEMPLATE.docx with scores, generates 29 matplotlib charts (1 overall trend + 14 factor comparisons + 14 score trends), embeds them, converts to PDF via docx2pdf, and cleans up the temporary DOCX.
Score Ratings
| Rating | Range | Action |
|---|---|---|
| Excellent | 90-100 | Monitor for regressions |
| Good | 75-89 | Review specific sub-criteria |
| Acceptable | 60-74 | Prioritize improvements |
| Poor | 40-59 | Investigate root causes |
| Failing | 0-39 | Immediate investigation required |