GDPval - chunhualiao/public-docs GitHub Wiki

what is in it?

The GDPval benchmark is explicitly designed to cover 1,320 specialized tasks across 44 occupations (including Software Developers) from the top 9 industries contributing to US GDP. The key differentiator is that these are not academic exams; they are real-world work products designed by professionals with an average of 14 years of experience, and each task takes an average of seven hours for a human expert to complete.

Here are a few examples of the type of complex, multi-step tasks included in the GDPval evaluation, categorized by the output artifact:

1. Spreadsheet/Data Analysis Tasks (Financial/Operational Domain)

These require multiple steps of parsing, calculation, filtering, and structured output creation—all highly relevant to a Senior Software Engineer working on business logic or operational systems.

Task Type (e.g., Resource Utilization Analysis): Given a master project allocation spreadsheet (MarchBudget.xlsx) and utilization data, generate a new Excel document with two tabs.
- Deliverable 1: Structure the data to answer: "Are any of the five departments at risk of being over or underutilized (more than 90% or less than 60%)?"
- Deliverable 2: Identify projects that exceeded the total allocated hours for the month.
- Difficulty: Requires deep data literacy, calculation precision, adherence to multiple numerical constraints, and the ability to correctly structure the output file (.xlsx).
Task Type (e.g., Audit Sampling & Variance Analysis): Given Q2 and Q3 financial data in a spreadsheet, perform two related tasks:
1. Calculate the required sample size for audit testing based on a 90% confidence level and a 10% tolerable error rate, with workings in a dedicated tab.
2. Perform a variance analysis and select samples that satisfy specific multi-layered business logic (e.g., metrics with $>20%$ variance OR metrics from a specific list of entities, AND metrics A1 and C1 must be included).
- Difficulty: Combines advanced statistical/mathematical calculation with complex, non-trivial conditional filtering and data governance rules.

2. Policy/Process Documentation Tasks (DevOps/Compliance Domain)

These are crucial for senior engineers involved in defining team processes, which requires clear, structured writing and understanding complex regulatory/procedural constraints.

Task Type (e.g., Regulatory SOP Creation): Draft a Standard Operating Procedure (SOP) for change management related to project scope, timelines, budget, and regulatory deliverables.
- Constraint: The SOP must clearly lay out the process for submitting, reviewing, approving, and documenting changes in a way that is traceable and audit-ready. It must detail what triggers a formal review and who owns each part of the process.
- Difficulty: This is a long-horizon task requiring the model to synthesize policy, create a structured document with flow, and ensure compliance with implicit audit requirements.

3. Research and Synthesis Tasks (Domain-Specific Knowledge)

These require models to act as an expert researcher and analyst, combining external data with specific criteria.

Task Type (e.g., Engineering Site Selection): Investigate potential water sources by pulling and reviewing source water assessment data from a specific state EPA website.
- Constraint: The recommendation must be based on strict screening criteria: Well depth between 160-200, aquifer must be "sand and gravel," and the well must be active (description cannot include "abandoned," "inactive," etc.).
- Deliverable: An Excel file with two tabs: one with all extracted data (filterable by all criteria), and a second tab containing only the wells that meet all required criteria.
- Difficulty: Requires external web tool use, reading unstructured data (text descriptions), converting that into structured criteria, and producing a highly specific, filtered data product.

Summary of Difficulty

For a 70.9% win rate, GPT-5.2 is performing exceptionally well on tasks that are multi-step, require tool use (web/code interpreter), demand strict formatting/adherence to constraints, and take an experienced professional many hours to complete.

It's not: simple LeetCode problems or MMLU trivia.
It is: tasks that combine elements of Project Management, Financial Analysis, Technical Writing, and Regulatory Compliance—the exact type of complex, multi-faceted knowledge work that distinguishes a senior professional.

The GDPval paper's authors explain the motivation behind the benchmark and the process for task creation in this video: GDPval: An AI benchmark for real-world tasks.

http://googleusercontent.com/youtube_content/0

Human expert baseline

what would be human expert baseline number for GDPval?

This is a great, insightful question, as it defines the entire context of the GDPval benchmark.

The GDPval score is not an absolute score (like a test percentage where 100% is perfect). Instead, it is a Win Rate metric measured against a human expert baseline.

Therefore, the human expert baseline number for GDPval is approximately 50%.

Here's why:

1. The Metric: Head-to-Head Win Rate

What it measures: The GDPval benchmark's headline metric is the percentage of tasks where the AI's deliverable is judged by another domain expert to be better than or equal to the human expert's deliverable.
The Baseline: If you were to pit two human experts (of similar, high quality) against each other in a blind comparison, you would expect each to be judged "better than or equal to" the other about 50% of the time. This establishes the human expert's performance as the center point of the comparison.

2. Interpretation of the Scores

Score	Interpretation
50%	Human Parity: The model's output quality is indistinguishable from the average human expert on these tasks.
< 50%	Sub-Human Expert: The human expert's work is preferred more often than the model's.
> 50%	Super-Human Expert: The model's work is preferred more often than the human expert's.

Conclusion on the 70.9%

The 70.9% score for GPT-5.2 Thinking is thus highly significant because it suggests the model's output is judged to be of equal or higher quality than a real-world human professional 7 out of every 10 times.

This is well beyond parity and suggests that, in the tasks measured by GDPval, the AI model has demonstrably surpassed the quality level of experienced industry experts.

SOTA 12/11/2025

Based on the image provided, the number in the last row is actually 70.9% (not 70.0%), and it appears to be highly significant for several reasons.

Here is an analysis of why that specific data point stands out:

1. Massive Generation-over-Generation Leap

The most striking aspect of the "GDPval" row is the delta between the two OpenAI models.

GPT-5.1 Thinking: Scored 38.8%
GPT-5.2 Thinking: Scored 70.9%

This represents an approximate 82% increase in performance between version 5.1 and 5.2. In the context of AI benchmarks, valid performance gains are usually measured in single-digit percentages. A jump of over 30 percentage points suggests a fundamental breakthrough in capability, likely related to the "Thinking" (reasoning) aspect of the model applied to complex tasks.

2. Competitive Dominance

While the other models in the chart are competitive in other areas (for example, Gemini 3 Pro is very close on GPQA Diamond and wins on FrontierMath), the gap in GDPval is exceptionally wide:

Claude Opus 4.5: 59.6%
Gemini 3 Pro: 53.5%

GPT-5.2 is beating the nearest competitor by over 11 percentage points. In high-level benchmarks, a lead this large indicates a distinct "moat" or advantage in that specific domain.

3. The Nature of the Benchmark ("Knowledge Work")

The benchmark is labeled "GDPval (Knowledge work tasks)."

Significance: This implies the benchmark measures the AI's ability to perform economically valuable work (tasks that contribute to GDP).
Threshold of Utility: A score of 38.8% (GPT-5.1) likely implies the model is helpful but requires heavy human supervision. Crossing the 70% threshold suggests the model has reached a level of reliability where it can autonomously handle a vast majority of complex knowledge work. This is often viewed as a tipping point for widespread economic adoption.

Summary

The 70.9% is significant because it represents a non-linear, exponential improvement rather than a linear one. It suggests that while previous models struggled with real-world economic tasks (scoring below 40-50%), this specific iteration has potentially "solved" a large portion of them, far outpacing both its predecessor and its competitors.

Would you like me to analyze the trends in the "FrontierMath" row, where the results seem to tell a different story?