SWE Bench - chunhualiao/public-docs GitHub Wiki

https://www.swebench.com/

  • focus on the Verified Leaderboard
Screenshot 2025-01-01 at 7 32 11 PM

FAQs

https://gist.github.com/chunhualiao/31480c116ba97a15396f23e2568d7131 FAQ

Summarized version

Succinct Summary of QA Pairs:

Q: How does SWE-bench handle multiple solutions to the same issue?
A: SWE-bench supports diverse solutions by validating correctness through execution-based testing, ensuring task flexibility and robustness without restricting solutions to a specific structure.


Q: How does SWE-bench find suitable repositories, issues, and PRs from GitHub?
A: It uses a systematic pipeline: selects popular Python repositories with open-source licenses, filters PRs resolving issues and modifying tests, and validates them through automated testing, narrowing down from 90,000 PRs to 2,294 task instances.


Q: What does a SWE-bench task instance look like?
A: Each task includes a problem statement, codebase snapshot, fail-to-pass tests, and a reference patch. Example: Fixing identity matrix handling in sympy with provided issue details, test, and expected patch.


Q: How to use SWE-bench to evaluate an LLM?
A:

  1. Select a task instance.
  2. Prepare model input (e.g., issue description, relevant code).
  3. Prompt the model to generate a patch.
  4. Apply the patch, run tests, and evaluate results (e.g., task resolution and patch quality).
  5. Aggregate metrics over multiple tasks for performance analysis.

Q: Who identifies relevant code from large codebases?
A: SWE-bench uses retrieval techniques (e.g., BM25) to narrow the context, while the model interprets the issue and modifies retrieved files. This balances task realism and computational feasibility.


Q: Do SWE-bench leaderboard submissions use oracle retrieval? Why not?
A: No, because oracle retrieval is unrealistic and unfair. Submissions rely on sparse retrieval methods to simulate real-world challenges, ensuring fair evaluation of models' problem-solving abilities.

How to check submissions?

https://www.swebench.com/viewer.html

  • select "Verified"
  • select the submission:
  • download (Logs)

Let me explain the structure of the JSON file. The file contains an array of objects, where each object represents a code change prediction and has the following structure:

  1. Each object has two main fields:

    • instance_id: A unique identifier for the code change, typically in the format "repository__project-number" (e.g., "astropy__astropy-14365")
    • model_patch: Contains the actual code changes in git diff format, showing the modifications made to various files
    • model_name_or_path: Identifies the AI model used to generate the prediction
  2. The model_patch field contains git-style diffs that show:

    • The files that were modified
    • The exact changes made (additions and deletions)
    • Line numbers and context for the changes
    • The changes are in unified diff format (with + for additions and - for deletions)

For example, in the first entry:

  • The change is for astropy__astropy-14365
  • It modifies files in the Astropy project
  • The main change appears to be adding case-insensitive matching to a regular expression pattern (adding (?i) to the regex)
  • It includes test files to verify the change

The JSON file is essentially a collection of these code change predictions, where each entry represents a separate code modification suggestion. The file was originally in JSONL format (one JSON object per line) and has been converted to a proper JSON array format, making it easier to process with standard JSON tools and libraries.

This kind of dataset is typically used for:

  • Training or evaluating code modification models
  • Analyzing patterns in code changes
  • Benchmarking automated code repair systems
  • Studying software evolution and maintenance patterns
⚠️ **GitHub.com Fallback** ⚠️