WebArena - chunhualiao/public-docs GitHub Wiki

WebArena is a realistic and reproducible web environment designed for building and evaluating autonomous agents. It consists of fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management[1][3]. The environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving[5].

Key features of WebArena include:

  1. A benchmark with 812 test examples for evaluating the functional correctness of task completions[1].
  2. Tasks that are diverse, long-horizon, and designed to emulate real-world activities humans routinely perform on the internet[1][5].
  3. Annotated programs to programmatically validate the functional correctness of each task[3].

WebArena aims to address the disconnect between current AI agents, which are primarily created and tested in simplified synthetic environments, and real-world scenarios[5]. The benchmark tasks are conceptually simple for humans but challenging for AI systems, requiring multiple reasoning steps to answer[1].

Experiments with baseline agents, including those using GPT-4, have shown that solving complex tasks in WebArena is challenging. The best GPT-4-based agent achieved only a 14.41% end-to-end task success rate, significantly lower than the human performance of 78.24%[1][5]. These results highlight the need for further development of robust agents and demonstrate that current state-of-the-art large language models are far from perfect performance in these real-life tasks[5].

Citations: