PaperBench - chunhualiao/public-docs GitHub Wiki

https://www.alphaxiv.org/abs/2504.01848

PaperBench: Evaluating AI’s Ability to Replicate AI Research, April 2025

How is a paper replicated?

The paper "PaperBench: Evaluating AI's Ability to Replicate AI Research" details a benchmark where AI agents are tasked with replicating research papers. The "prompts" used to guide the AI agents in this replication process include the research paper itself, specific system messages for the AI agent's architecture, and detailed task instructions. Here's a breakdown:

  1. The Research Paper to be Replicated:

    • The primary input for the AI agent is the research paper it needs to replicate. This is provided in both PDF and Markdown format, located at /home/paper within the agent's environment (as mentioned in Figure 13 of the provided paper).
    • An addendum.md file is also provided, which contains clarifications from the paper's original authors and specifies parts that might be out of scope (Figure 13).
  2. System Prompts for AI Agents: The paper describes two main agent scaffolding approaches, each with its own system prompt:

    • BasicAgent System Prompt (Figure 10, Appendix F.1): This prompt (displayed on page 28 of the paper) instructs the agent that it is a helpful agent attempting to solve a task using available functions. Key instructions include:
      • Using bash and python tools to create the replication.
      • Actually replicating the paper, not just providing instructions.
      • Executing single commands per tool use.
      • Working "BIT BY BIT" and not trying to do everything in one go.
      • Not stopping until all results are replicated and ensuring reproduce.sh can reproduce all metrics, figures, tables, and findings.
      • Using the paginated file reader tool multiple times to read the entire paper.
    • IterativeAgent System Prompt (Figure 11, Appendix F.2): This prompt (displayed on page 28) is used for the IterativeAgent, which is designed to encourage models to use their full allocated time. It shares similarities with the BasicAgent prompt but is tuned to prompt the model to work in a piecemeal fashion and also includes:
      • Emphasis on using available time to refine and optimize the solution.
      • Instruction to create a clean codebase, editing/overwriting files rather than creating new ones when it makes sense.
      • Specific instruction on handling newlines in code (use the newline character, not \n).
    • IterativeAgent Continue Message (Figure 12, Appendix F.2): If the IterativeAgent produces a message with no tool calls, this user message (displayed on page 28) is appended to the message history to prompt further action:
      • "Now given the previous progress made by the agent, take the next step towards replicating the paper."
      • Reminds the agent it has a lot of time, to reason about the best next step, use tools to write/execute code, and prioritize important parts of the paper.
  3. Task Instructions (Figures 13 & 14, Appendix F.3): These instructions (detailed on pages 29 and 30 of the paper) are fed to the agent (e.g., via a user message for BasicAgent) and outline the specifics of the replication task:

    • Overall Task (Figure 13):
      • Replicate core contributions of the paper within the allotted time, making prioritization decisions.
      • Partial credit is assigned, weighted by importance.
      • Experiments solely in the appendix are out of scope unless their details are essential for a main-body experiment.
      • Resources include the paper, addendum.md, and potentially a rubric.json.
      • A blacklist.txt specifies resources (like the original codebase) that cannot be used.
      • The submission must be a git repository at /home/submission/ (max 1GB, source code only) with a reproduce.sh script at its root. This script will be run in a fresh Ubuntu 24.04 Docker container with an NVIDIA A10 GPU.
      • A README.md is required to describe the replication attempt.
    • Reproduction Script Toy Example & Additional Notes (Figure 14):
      • Provides a toy example of a paper's claim, a count.py script, a reproduce.sh script, and an explanation of how it would be graded (implementation, execution, result).
      • Notes on available compute (${type_of_processor}), total runtime (${max_time_in_hours} hours), total steps (${max_steps}), API keys (OpenAI, Anthropic, Gemini, OpenRouter), root access, and a reminder to actually reproduce the paper and use all available time unless fully complete.

In essence, the AI agent receives the research paper and addendum as the primary content to work on. The system prompts set the agent's persona and high-level methodology, while the task instructions provide the detailed rules, goals, and constraints for the PaperBench evaluation.

One concrete example

Below is a walk-through of how an agent is evaluated in PaperBench, illustrated with a concrete (but slightly simplified) example: replicating the ICML-24 Spotlight paper “BBox-Adapter: Lightweight Adapting for Black-Box Large Language Models.”


1 Task set-up

  1. Inputs the agent receives

    • Paper package – the PDF/Markdown of the target paper plus a short addendum that clarifies any underspecified details (e.g. hidden hyper-parameters).
    • Machine & tools – a fresh Ubuntu 24.04 VM with one NVIDIA A10 GPU and internet access (but a per-paper blacklist blocks the original authors’ repo).
    • Instructions – the agent must build a self-contained git repo in /home/submission and include a one-click driver script called reproduce.sh.
  2. What the agent is not shown

    • The detailed grading rubric (to prevent gaming)
    • Any existing public code for BBox-Adapter

2 Agent’s coding phase

The agent now plans and iterates (often for 12 h-plus):

Step Concrete actions for BBox-Adapter
2-a Parse the method section → write adapter.py implementing the noise-aware LoRA layer described in §3
2-b Read the datasets section → script prepare_data.py to download the CommonsenseQA split used in the paper
2-c Create train.py that fine-tunes a GPT-J 6B endpoint through OpenAI’s completion API, injecting the adapter weights
2-d Draft evaluate.py that reproduces Table 2 F1/EM metrics
2-e Assemble reproduce.sh to (1) install deps, (2) run data prep, (3) launch fine-tuning, (4) run evaluation and save results.json

The repo is committed and the agent calls “end task”.


3 Reproduction phase (automatic)

PaperBench copies the submission to a clean VM, then simply runs:

bash reproduce.sh   # hard-capped at 12 h in the paper’s experiments

This produces:

  • reproduce.log – console output
  • results.json / plots/ – the numbers & figures the author claims in the paper

Running everything in a new VM guarantees the agent isn’t smuggling in hard-coded results.


4 Grading with hierarchical rubrics

Each paper has a tree-shaped rubric written with the original author (BBox-Adapter’s rubric has 422 nodes).

  • Leaf node types

    1. Code Development – “Adapter layer implements the re-parameterisation of Eq. (3)”
    2. Execution – “adapter.py was actually executed during reproduce.sh”
    3. Result Match – “F1↑ when using BBox-Adapter vs. baseline reproduces the +8 pt gain in Table 2”

Leaves are scored pass / fail; weighted averages propagate up the tree (Figure 2 shows the mechanics).


5 LLM-based judge (“SimpleJudge”)

Instead of a human TA spending hours, PaperBench passes each leaf to an LLM judge (o3-mini in the paper) which reads:

  • the paper extract
  • the relevant slice of the submission (top-ranked files)
  • the criterion text

The judge decides 0 / 1 and explains its reasoning; on a held-out benchmark it reaches F1 ≈ 0.83 vs. human labels.


6 Final score

All leaf scores roll up to a single Replication Score (0–100 %). Suppose our BBox-Adapter replica passes:

  • 80 % of Code-Dev leaves
  • 60 % of Execution leaves
  • 40 % of Result-Match leaves

With the rubric weights, the root might aggregate to ≈ 55 %, exactly like the worked example in Figure 2.


✦ Tiny toy illustration

The paper includes a minimal “strawberry r-count” toy to show the logic: a count.py script plus reproduce.sh is executed; the judge checks the generated output.csv and marks Implementation, Execution, Result Match = 1 / 1 / 1.


7 Variants & cost savers

  • PaperBench Code-Dev – grade only Code-Development nodes (no GPU run) for ~-85 % cost.
  • Judge pruning – collapse deep rubric branches to cut judge token usage by ×10 with little accuracy loss (Appendix H).

Recap

PaperBench evaluates whether an autonomous agent can read a fresh ML paper, implement it from scratch, run the experiments, and match the results. The pipeline—task ingestion → agent coding → sandboxed reproduction → rubric grading by an LLM judge—yields a quantitative Replication Score that lets researchers compare both agents and models on authentic, end-to-end ML R&D work.