OpenHands:SWE Bench - chunhualiao/public-docs GitHub Wiki
Official docker!
https://github.com/All-Hands-AI/OpenHands/tree/main/evaluation/benchmarks/swe_bench
https://github.com/All-Hands-AI/OpenHands/issues/6045 Installation problem for developers.
how to run SWE-Bench using OpenHands CodeAct 2.1?
To run SWE-Bench using OpenHands CodeAct 2.1, follow these steps:
-
Install OpenHands and its dependencies:
- Set up Docker on your system
- Install Python
- Clone the OpenHands repository:
git clone https://github.com/All-Hands-AI/OpenHands
[1][5]
-
Configure the environment:
- Install the required dependencies
- Set up your chosen provider (e.g., Claude, Gemini, or X-AI)[1]
-
Use OpenHands' remote runtime feature to parallelize evaluation:
- This allows for efficient execution of unit tests and agent actions[4]
-
Implement your agent logic in the
predict
function:def predict(inputs: dict): # Insert your OpenHands CodeAct 2.1 agent logic here return {"instance_id": inputs['instance_id'], "model_patch": "Your generated patch", "model_name_or_path": "OpenHands-CodeAct-2.1"}
-
Run the evaluation using the SWE-Bench dataset:
- Load the SWE-Bench dataset into LangSmith or use the provided CSV file
- Use the
evaluate
function to generate predictions for the dataset[6]
-
Execute the SWE-Bench evaluation in Docker:
- Set up Docker images for parallel execution
- Use the
run_instances
function from the SWE-Bench evaluation script to run the predictions[6]
-
Convert the evaluation results to LangSmith feedback format for analysis[6]
By following these steps, you can run SWE-Bench using OpenHands CodeAct 2.1 and evaluate its performance on real-world software engineering tasks.
Citations:
- [1] https://www.youtube.com/watch?v=P7jC0L-hjM0
- [2] https://www.youtube.com/watch?v=su067vIv7eQ
- [3] https://app.daily.dev/posts/all-hands-ai-open-sources-openhands-codeact-2-1-a-new-software-development-agent-to-solve-over-50--1elyw5yw5
- [4] http://arxiv.org/pdf/2412.21139.pdf
- [5] https://www.youtube.com/watch?v=PJ8XUfivG20
- [6] https://docs.smith.langchain.com/evaluation/tutorials/swe-benchmark
- [7] https://nlp.elvissaravia.com/p/ai-agents-weekly-codeact-21-copilot
- [8] https://www.researchgate.net/publication/387540461_Training_Software_Engineering_Agents_and_Verifiers_with_SWE-Gym
- [9] https://www.swebench.com