SWE Bench - chunhualiao/public-docs GitHub Wiki
- how to run it using OpenHands
- focus on the Verified Leaderboard

https://gist.github.com/chunhualiao/31480c116ba97a15396f23e2568d7131 FAQ
Summarized version
Q: How does SWE-bench handle multiple solutions to the same issue?
A: SWE-bench supports diverse solutions by validating correctness through execution-based testing, ensuring task flexibility and robustness without restricting solutions to a specific structure.
Q: How does SWE-bench find suitable repositories, issues, and PRs from GitHub?
A: It uses a systematic pipeline: selects popular Python repositories with open-source licenses, filters PRs resolving issues and modifying tests, and validates them through automated testing, narrowing down from 90,000 PRs to 2,294 task instances.
Q: What does a SWE-bench task instance look like?
A: Each task includes a problem statement, codebase snapshot, fail-to-pass tests, and a reference patch. Example: Fixing identity matrix handling in sympy
with provided issue details, test, and expected patch.
Q: How to use SWE-bench to evaluate an LLM?
A:
- Select a task instance.
- Prepare model input (e.g., issue description, relevant code).
- Prompt the model to generate a patch.
- Apply the patch, run tests, and evaluate results (e.g., task resolution and patch quality).
- Aggregate metrics over multiple tasks for performance analysis.
Q: Who identifies relevant code from large codebases?
A: SWE-bench uses retrieval techniques (e.g., BM25) to narrow the context, while the model interprets the issue and modifies retrieved files. This balances task realism and computational feasibility.
Q: Do SWE-bench leaderboard submissions use oracle retrieval? Why not?
A: No, because oracle retrieval is unrealistic and unfair. Submissions rely on sparse retrieval methods to simulate real-world challenges, ensuring fair evaluation of models' problem-solving abilities.
https://www.swebench.com/viewer.html
- select "Verified"
- select the submission:
- download (Logs)
Let me explain the structure of the JSON file. The file contains an array of objects, where each object represents a code change prediction and has the following structure:
-
Each object has two main fields:
-
instance_id
: A unique identifier for the code change, typically in the format "repository__project-number" (e.g., "astropy__astropy-14365") -
model_patch
: Contains the actual code changes in git diff format, showing the modifications made to various files -
model_name_or_path
: Identifies the AI model used to generate the prediction
-
-
The
model_patch
field contains git-style diffs that show:- The files that were modified
- The exact changes made (additions and deletions)
- Line numbers and context for the changes
- The changes are in unified diff format (with + for additions and - for deletions)
For example, in the first entry:
- The change is for
astropy__astropy-14365
- It modifies files in the Astropy project
- The main change appears to be adding case-insensitive matching to a regular expression pattern (adding
(?i)
to the regex) - It includes test files to verify the change
The JSON file is essentially a collection of these code change predictions, where each entry represents a separate code modification suggestion. The file was originally in JSONL format (one JSON object per line) and has been converted to a proper JSON array format, making it easier to process with standard JSON tools and libraries.
This kind of dataset is typically used for:
- Training or evaluating code modification models
- Analyzing patterns in code changes
- Benchmarking automated code repair systems
- Studying software evolution and maintenance patterns