GAIA benchmark - chunhualiao/public-docs GitHub Wiki

The GAIA benchmark (General AI Assistants) is a comprehensive evaluation framework designed to assess AI systems' proficiency in handling real-world tasks that require a combination of reasoning, multi-modality processing, web browsing, and tool-use capabilities. Introduced in November 2023, GAIA aims to push the boundaries of AI by focusing on tasks that are straightforward for humans but challenging for AI systems. cite turn0search2

Key Features of GAIA:

  • Real-World Questions: GAIA comprises 466 human-designed questions that reflect practical scenarios, ranging from everyday tasks to complex scientific inquiries. These questions often require the AI to interpret and integrate information from various sources and formats, including text, images, and spreadsheets. cite turn0search2

  • Fundamental Abilities Tested: The benchmark evaluates several core competencies:

    • Reasoning: The ability to process information and draw logical conclusions.
    • Multi-Modality Handling: The capacity to manage and integrate data from multiple formats and sources.
    • Web Browsing: Proficiency in navigating the internet to gather relevant information.
    • Tool Use: Skill in utilizing various tools to perform tasks or solve problems.
  • Performance Disparity: A notable gap exists between human and AI performance on GAIA tasks. Human respondents achieve approximately 92% accuracy, while advanced AI systems like GPT-4 equipped with plugins score around 15%. This contrast highlights the challenges AI faces in tasks that are simple for humans but complex for machines. cite turn0search2

  • Evaluation Methodology: GAIA employs an automated and factual evaluation process, requiring answers in the form of strings, numbers, or lists. Each question has a single correct answer, and evaluation is conducted through a quasi-exact match between the model's response and the ground truth. cite turn0search2

  • Leaderboard and Ongoing Development: To foster progress and encourage competition, GAIA maintains a leaderboard where AI systems are ranked based on their performance. As of December 2024, H2O.ai's h2oGPTe Agent secured the top position with a score of 65%, outperforming other major AI models. cite turn0search5

GAIA represents a significant step toward developing AI systems capable of robust, human-like problem-solving in real-world scenarios. By focusing on tasks that are simple for humans but challenging for AI, GAIA provides a valuable benchmark for measuring progress in the field of artificial intelligence.

The GAIA benchmark categorizes its 466 questions into three levels of complexity, each requiring varying degrees of reasoning, tool use, and multi-modal handling. Here are examples for each level:

Level 1: Basic Complexity

Example:

What was the actual enrollment count of the clinical trial on H. pylori in acne vulgaris patients from Jan-May 2018 as listed on the NIH website?

Expected Answer: 90

Explanation: This question necessitates retrieving specific information from a known database, involving straightforward web browsing and data extraction.

Level 2: Intermediate Complexity

Example:

If this whole pint is made up of ice cream, how many percent above or below the US federal standards for butterfat content is it when using the standards as reported by Wikipedia in 2020? Answer as + or - a number rounded to one decimal place.

Expected Answer: +4.6

Explanation: This task requires interpreting nutritional information from an image of an ice cream container, comparing it to textual data from a specific source, and performing calculations to determine the percentage difference.

Level 3: Advanced Complexity

Example:

In NASA’s Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute? Exclude any astronauts who did not spend any time in space. Give the last name of the astronaut, separated from the number of minutes by a semicolon.

Expected Answer: White; 5,876

Explanation: This complex query involves multiple steps: analyzing a specific image to identify astronauts, determining their group affiliations, researching each member's time spent in space, and formatting the answer according to precise instructions.

These examples illustrate the escalating complexity across GAIA's levels, highlighting the benchmark's design to assess AI systems' capabilities in handling tasks that range from simple data retrieval to intricate, multi-step reasoning and multi-modal integration.