Large Language Models Are Not Strong Abstract Reasoners
Authors: Gaƫl Gendron, Qiming Bao, Michael Witbrock, Gillian Dobbie
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform extensive evaluations of state-of-the-art LLMs, showing that they currently achieve very limited performance in contrast with other natural language tasks, even when applying techniques that have been shown to improve performance on other NLP tasks. |
| Researcher Affiliation | Academia | Ga el Gendron , Qiming Bao , Michael Witbrock , Gillian Dobbie University of Auckland gael.gendron@auckland.ac.nz |
| Pseudocode | No | The paper does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format. |
| Open Source Code | Yes | We release our code and data at: https://github.com/ Strong-AI-Lab/Logical-and-abstract-reasoning. |
| Open Datasets | Yes | We build a text-based version of the Abstract Causal Reasoning (ACRE) dataset [Zhang et al., 2021a] that we name ACRET . The second dataset is the Abstract Reasoning Challenge (ARC) dataset [Chollet, 2019]. We select a subset of the BIG-Bench dataset [Rule, 2020; Srivastava et al., 2022] that we name BIG-Bench-F for Functions. We select a subset of the Evals dataset [Open AI, 2023] representing logic puzzles. Pointer-Value Retrieval (PVR) tasks [Zhang et al., 2021b] involve selecting one or several values in a list and applying a function on this subset. RAVEN [Zhang et al., 2019] is a VQA dataset composed of sequences of images to complete. |
| Dataset Splits | No | The paper discusses 'training and test sets' and 'out-of-distribution (o.o.d) splits', and mentions fine-tuning LLaMA2 on 'training sets'. However, it does not provide specific percentages, sample counts, or clear methodologies for generating training, validation, and test splits needed for reproduction beyond mentioning the existence of these sets. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory, or cloud instance specifications) used to run the experiments. |
| Software Dependencies | No | The paper mentions that the code interpreter is in Python ('delegated to a code interpreter in Python') but does not specify any software dependencies with version numbers (e.g., Python version, specific libraries like PyTorch, TensorFlow, or HuggingFace Transformers versions). |
| Experiment Setup | Yes | For all models, we evaluate the 7B parameters versions by default. We perform experiments on a subset of our framework using Chain-of-Thought prompting [Wei et al., 2022]. By default, we give four examples to the model before asking it to answer. |