CodeAct

Motivation

Previous LLM agents

each action is a single call to an external tool api

CodeAct 2.1 is the latest version of the AI-powered software development agent within the OpenHands framework. This update brings significant improvements and capabilities:

(CodeAct Idea):

The CodeActAgent implements the "CodeAct" principle, which aims to simplify and improve the performance of LLM-based agents by consolidating their actions primarily into a unified code action space.
Instead of relying on numerous distinct tools or complex API calls, the agent often achieves its goals by generating and executing code (typically Bash commands or Python scripts) within its environment.
each action can be a complex piece of code logic calling multiple tool api functions
few actions due to more capable actions

Comparison with examples

Okay, here is a table comparing example actions between a typical "Tool-Based Agent" and the CodeActAgent.

The key difference lies in how the agent interacts with the environment to achieve a goal.

Tool-Based Agents typically rely on a predefined set of specific tools (like functions or APIs) they can call. The LLM's task is to select the correct tool and provide the necessary arguments, often in a structured format like JSON.
CodeActAgent primarily relies on generating executable code (Bash or Python) to perform actions. It leverages the LLM's coding knowledge and the flexibility of programming languages to interact with the environment, often using standard commands or libraries.

Side-by-Side Action Comparison:

Task / Goal	Typical Tool-Based Agent Action	`CodeActAgent` Action (`CodeAct`)
Read the content of `README.md`	`json { "tool_name": "ReadFileTool", "arguments": { "path": "README.md" } }`	`python print(open('README.md', 'r').read())` <br/> or <br/> `bash cat README.md`
List files in the current directory	`json { "tool_name": "ListDirectoryTool", "arguments": { "path": "." } }`	`bash ls -al`
Write "Hello World" to `output.txt`	`json { "tool_name": "WriteFileTool", "arguments": { "path": "output.txt", "content": "Hello World" } }`	`python with open('output.txt', 'w') as f: f.write('Hello World')` <br/> or <br/> `bash echo "Hello World" > output.txt`
Find all Python files containing "import requests"	`json { "tool_name": "SearchCodeTool", "arguments": { "query": "import requests", "file_pattern": "*.py" } }`	`bash grep -r --include='*.py' 'import requests' .`
Install the `requests` Python package	`json { "tool_name": "InstallPackageTool", "arguments": { "package_manager": "pip", "package_name": "requests" } }`	`bash pip install requests`
Fetch content from a URL	`json { "tool_name": "HTTPRequestTool", "arguments": { "method": "GET", "url": "https://example.com" } }`	`python import requests response = requests.get('https://example.com') print(response.text)` <br/> or <br/> `bash curl https://example.com`
Run unit tests	`json { "tool_name": "RunTestsTool", "arguments": { "framework": "pytest" } }`	`bash pytest`

Key Takeaways:

Flexibility vs. Structure: CodeActAgent gains flexibility by leveraging the vast capabilities of code and shell commands. It can potentially perform actions not covered by a fixed toolset and compose complex operations (loops, conditionals) within a single code block. Tool-based agents offer more structure and potentially higher reliability for specific, well-defined tasks, but might be limited if a required tool doesn't exist.
LLM Task: For tool-based agents, the LLM primarily focuses on selecting the right tool and filling parameters. For CodeActAgent, the LLM must generate syntactically correct and functionally effective code to achieve the goal.
Error Handling: CodeActAgent can potentially leverage error messages (like tracebacks from Python or shell errors) returned by the execution environment to self-debug and retry. Tool-based agents might rely on simpler success/failure signals from the tool execution.

The CodeAct approach aims to make LLM agents more powerful and adaptable by treating code execution as the primary interface to the environment.

Key features

Key Capabilities and Actions:

Code Execution: Can execute arbitrary Linux Bash commands and Python code within a secure, sandboxed environment (often using an interactive Python interpreter simulated via Bash).
File Handling: Can create, read, and edit files within the workspace, essential for software development tasks. Later versions (like v2.1) specifically improved directory traversal capabilities.
Web Browse: While the core idea focuses on code, CodeActAgent architectures often integrate or work alongside Browse capabilities to gather information from the web.
Natural Language Interaction (Converse): Can communicate with the user in natural language to ask clarifying questions, confirm understanding, or report progress.
Planning and Reasoning: Uses the underlying LLM's capabilities (often guided by techniques like Chain-of-Thought) to reason about the task, break it down into steps, and decide on the next action based on observations and history.

Performance

Achieved a 53% resolution rate on SWE-Bench Verified, setting a new state-of-the-art benchmark[2][5].
Demonstrated a 41.7% success rate on SWE-Bench Lite[2].

Key Enhancements

Advanced Language Model: Powered by Anthropic's Claude-3.5 Sonnet model, improving natural language comprehension and problem-solving abilities[8].
Function Calling: Refined functionality for more precise task execution[5].
Directory Traversal: Significant improvements in navigating complex directory structures[5].
Real-World Application: Capable of autonomously solving actual GitHub issues, moving beyond controlled environments[2].

Impact on Software Development

CodeAct 2.1 aims to streamline the development process by:

Providing intelligent code suggestions
Automating repetitive tasks
Improving code quality
Assisting with debugging and troubleshooting

Open-Source Advantage

As an open-source tool, CodeAct 2.1 allows developers to freely use, improve, and adapt it to their needs, fostering community-driven innovation in AI-assisted software engineering[2].

This release represents a significant leap forward in AI-powered software development tools, combining state-of-the-art performance with the flexibility and accessibility of an open-source framework.

Citations: