SWE Bench - chunhualiao/public-docs GitHub Wiki

how to run it using OpenHands
- https://github.com/All-Hands-AI/OpenHands/tree/main/evaluation/benchmarks/swe_bench

https://www.swebench.com/

focus on the Verified Leaderboard

FAQs

https://gist.github.com/chunhualiao/31480c116ba97a15396f23e2568d7131 FAQ

Summarized version

Succinct Summary of QA Pairs:

Q: How does SWE-bench handle multiple solutions to the same issue?
A: SWE-bench supports diverse solutions by validating correctness through execution-based testing, ensuring task flexibility and robustness without restricting solutions to a specific structure.

Q: How does SWE-bench find suitable repositories, issues, and PRs from GitHub?
A: It uses a systematic pipeline: selects popular Python repositories with open-source licenses, filters PRs resolving issues and modifying tests, and validates them through automated testing, narrowing down from 90,000 PRs to 2,294 task instances.

Q: What does a SWE-bench task instance look like?
A: Each task includes a problem statement, codebase snapshot, fail-to-pass tests, and a reference patch. Example: Fixing identity matrix handling in sympy with provided issue details, test, and expected patch.

Q: How to use SWE-bench to evaluate an LLM?
A:

Select a task instance.
Prepare model input (e.g., issue description, relevant code).
Prompt the model to generate a patch.
Apply the patch, run tests, and evaluate results (e.g., task resolution and patch quality).
Aggregate metrics over multiple tasks for performance analysis.

Q: Who identifies relevant code from large codebases?
A: SWE-bench uses retrieval techniques (e.g., BM25) to narrow the context, while the model interprets the issue and modifies retrieved files. This balances task realism and computational feasibility.

Q: Do SWE-bench leaderboard submissions use oracle retrieval? Why not?
A: No, because oracle retrieval is unrealistic and unfair. Submissions rely on sparse retrieval methods to simulate real-world challenges, ensuring fair evaluation of models' problem-solving abilities.

How to check submissions?

https://www.swebench.com/viewer.html

select "Verified"
select the submission:
download (Logs)

Let me explain the structure of the JSON file. The file contains an array of objects, where each object represents a code change prediction and has the following structure:

Each object has two main fields:
- instance_id: A unique identifier for the code change, typically in the format "repository__project-number" (e.g., "astropy__astropy-14365")
- model_patch: Contains the actual code changes in git diff format, showing the modifications made to various files
- model_name_or_path: Identifies the AI model used to generate the prediction
The model_patch field contains git-style diffs that show:
- The files that were modified
- The exact changes made (additions and deletions)
- Line numbers and context for the changes
- The changes are in unified diff format (with + for additions and - for deletions)

For example, in the first entry:

The change is for astropy__astropy-14365
It modifies files in the Astropy project
The main change appears to be adding case-insensitive matching to a regular expression pattern (adding (?i) to the regex)
It includes test files to verify the change

The JSON file is essentially a collection of these code change predictions, where each entry represents a separate code modification suggestion. The file was originally in JSONL format (one JSON object per line) and has been converted to a proper JSON array format, making it easier to process with standard JSON tools and libraries.

This kind of dataset is typically used for:

Training or evaluating code modification models
Analyzing patterns in code changes
Benchmarking automated code repair systems
Studying software evolution and maintenance patterns

SWE Bench - chunhualiao/public-docs GitHub Wiki

FAQs

Succinct Summary of QA Pairs:

How to check submissions?

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️