computer use - chunhualiao/public-docs GitHub Wiki
Top Benchmarks for Vision-Capable AI Models in Computer Use
Vision-capable AI models, particularly multimodal agents that process screenshots, GUIs, and visual interfaces to interact with computers (e.g., desktop apps, web browsers, and file systems), are evaluated using specialized benchmarks. These focus on tasks like GUI navigation, element grounding, and multi-step workflows in real or simulated environments. Based on recent evaluations, the most prominent and comprehensive benchmarks emphasize open-ended, real-world computer use scenarios. Below, I highlight the top ones, prioritized by their scope, recency, and relevance to vision-based interaction.
| Benchmark | Description | Key Tasks | Evaluation Metrics | Leading Model Performance |
|---|---|---|---|---|
| OSWorld | A scalable benchmark for multimodal agents in real computer environments (Ubuntu, Windows, macOS), testing open-ended tasks with arbitrary apps. It supports execution-based evaluation and interactive learning, addressing gaps in GUI grounding and operational knowledge. | 369 tasks across web/desktop apps, OS file I/O, and multi-app workflows (e.g., document editing, browsing). Excludes 8 Google Drive tasks for easier setup, leaving 361. | Execution-based success rate using 134 custom functions for reproducible assessment. | Best model (e.g., UI-TARS or Claude variants) achieves ~12.24% success rate; humans reach 72.36%. Related web-focused benchmarks include Mind2Web, WebArena, and VisualWebArena. |
| UI-Vision | A desktop-centric GUI benchmark for offline evaluation of visual agents in real-world scenarios, providing dense annotations (bounding boxes, UI labels, action trajectories) across 83 apps. It targets underexplored desktop environments beyond web tasks. | Three task types: Element Grounding (locating UI elements), Layout Grounding (spatial understanding), Action Prediction (predicting clicks, drags, keyboard inputs). | Well-defined metrics for fine-grained assessment of perception and interaction; focuses on professional software handling. | State-of-the-art models like UI-TARS-72B show limitations in spatial reasoning and complex actions (e.g., drag-and-drop); no standout top performer exceeds modest baselines. |
| OSUniverse | A benchmark for advanced GUI-navigation agents, emphasizing multimodal desktop tasks with increasing complexity. It's extensible, with automated validation for progress tracking, calibrated so SOTA agents score under 50% while humans achieve near-perfect results. | Tasks from basic (precision clicking) to complex (multi-step, multi-app workflows requiring dexterity and reasoning). | Automated scoring with <2% error rate; supports manual review for nuanced cases. | No specific top models listed; designed to challenge current SOTA agents without any exceeding 50%. |
| WebArena | A self-hosted environment for autonomous web agents, extended to vision-based interaction via screenshots. It simulates realistic browser use for functional correctness in open domains. | Tasks in e-commerce, social forums, code development, and content management (e.g., shopping, editing repos). | Functional correctness and task completion rate. | Varies by model; GPT-4V and Claude 3.5 achieve ~20-30% on vision variants, per related evals. |
| GAIA | A general benchmark for AI assistants handling multimodal inputs (text + images/files) in daily and professional tasks, including tool use for computer interactions. | 466 tasks mixing reasoning, science, and personal workflows (e.g., analyzing images for decisions). | Human-annotated success rate on tool invocation and output accuracy. | Top multimodal models like Gemini 2.5 score ~40-50%, with vision aiding complex file handling. |
These benchmarks reveal ongoing challenges: even leading models (e.g., from Anthropic's Claude, Google's Gemini, OpenAI's GPT-4o) struggle with visual grounding, long-horizon planning, and real-time adaptation, often below 20% on holistic success rates. For broader AI agent evals, see AgentBench (includes OS and web tasks) or PixelHelp (screenshot-based helpdesk simulations). To select the "best," OSWorld stands out for its real-environment breadth and active updates. If you're evaluating a specific model, I recommend starting with OSWorld's GitHub repo for hands-on testing.