GAIA: a benchmark for General AI Assistants
Authors: Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, Thomas Scialom
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. |
| Researcher Affiliation | Industry | 1Meta AI, 2Hugging Face, 3Auto GPT {gmialon, tscialom}@meta.com, clementine@huggingface.com |
| Pseudocode | No | The paper does not contain pseudocode or a clearly labeled algorithm block. |
| Open Source Code | No | The paper states that it releases the questions and answers for the benchmark, but it does not provide explicit access to source code for its own methodology or model implementation. |
| Open Datasets | Yes | We release a developer set of 166 annotated questions and release the remaining 300 questions without annotations: the benchmark will be notably hosted as a leaderboard. |
| Dataset Splits | Yes | We release a developer set of 166 annotated questions and release the remaining 300 questions without annotations: the benchmark will be notably hosted as a leaderboard. |
| Hardware Specification | No | The paper evaluates external LLMs (GPT-4, Auto GPT) and does not specify the hardware used for their evaluation experiments. |
| Software Dependencies | No | The paper mentions evaluating specific LLMs like 'GPT4 (Open AI, 2023)' and 'Auto GPT 6' but does not list any specific software dependencies with version numbers for its experimental setup beyond these external tools. |
| Experiment Setup | Yes | To ease answer extraction, we specify a format in the prefix prompt, see Figure 2. We use a prefix prompt before asking the model a question. |