GAIA - chunhualiao/public-docs GitHub Wiki

benchmark

GAIA (General AI Assistants) benchmark consists of 466 questions designed to evaluate next-generation LLMs with augmented capabilities[1][3]. These tasks assess fundamental abilities such as:

  1. Reasoning
  2. Multi-modality handling
  3. Web browsing
  4. Tool-use proficiency

The questions in GAIA are conceptually simple for humans but challenging for AI systems[3]. They simulate real-world applications and require multiple reasoning steps to answer[9]. Some key features of GAIA tasks include:

  • Text-based questions, sometimes accompanied by files like images or spreadsheets[6]
  • Questions that often require the use of tools such as web browsers and code interpreters[9]
  • Tasks that may involve handling multi-modal inputs (e.g., images, videos, excel sheets)[9]
  • Questions designed to have unique, factual answers, allowing for simple and robust automatic evaluation[6]

GAIA's approach differs from traditional AI benchmarks by focusing on tasks that are deceptively simple for humans but intricate for AI, aiming to push AI towards the next generation of capabilities[1].

Citations: