WebArena - chunhualiao/public-docs GitHub Wiki
WebArena is a realistic and reproducible web environment designed for building and evaluating autonomous agents. It consists of fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management[1][3]. The environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving[5].
Key features of WebArena include:
- A benchmark with 812 test examples for evaluating the functional correctness of task completions[1].
- Tasks that are diverse, long-horizon, and designed to emulate real-world activities humans routinely perform on the internet[1][5].
- Annotated programs to programmatically validate the functional correctness of each task[3].
WebArena aims to address the disconnect between current AI agents, which are primarily created and tested in simplified synthetic environments, and real-world scenarios[5]. The benchmark tasks are conceptually simple for humans but challenging for AI systems, requiring multiple reasoning steps to answer[1].
Experiments with baseline agents, including those using GPT-4, have shown that solving complex tasks in WebArena is challenging. The best GPT-4-based agent achieved only a 14.41% end-to-end task success rate, significantly lower than the human performance of 78.24%[1][5]. These results highlight the need for further development of robust agents and demonstrate that current state-of-the-art large language models are far from perfect performance in these real-life tasks[5].
Citations:
- [1] https://arxiv.org/html/2307.13854v4
- [2] https://invariantlabs.ai/blog/what-we-learned-from-analyzing-web-agents
- [3] https://www.cmu.edu/flame/research/2024/webarena.html
- [4] https://www.marktechpost.com/2024/02/09/cmu-researchers-introduce-visualwebarena-an-ai-benchmark-designed-to-evaluate-the-performance-of-multimodal-web-agents-on-realistic-and-visually-stimulating-challenges/
- [5] https://arxiv.org/abs/2307.13854
- [6] https://openreview.net/forum?id=oKn9c6ytLx