Large Language Models Are Not Strong Abstract Reasoners

Authors: Gaƫl Gendron, Qiming Bao, Michael Witbrock, Gillian Dobbie

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive evaluations of state-of-the-art LLMs, showing that they currently achieve very limited performance in contrast with other natural language tasks, even when applying techniques that have been shown to improve performance on other NLP tasks.
Researcher Affiliation Academia Ga el Gendron , Qiming Bao , Michael Witbrock , Gillian Dobbie University of Auckland gael.gendron@auckland.ac.nz
Pseudocode No The paper does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code Yes We release our code and data at: https://github.com/ Strong-AI-Lab/Logical-and-abstract-reasoning.
Open Datasets Yes We build a text-based version of the Abstract Causal Reasoning (ACRE) dataset [Zhang et al., 2021a] that we name ACRET . The second dataset is the Abstract Reasoning Challenge (ARC) dataset [Chollet, 2019]. We select a subset of the BIG-Bench dataset [Rule, 2020; Srivastava et al., 2022] that we name BIG-Bench-F for Functions. We select a subset of the Evals dataset [Open AI, 2023] representing logic puzzles. Pointer-Value Retrieval (PVR) tasks [Zhang et al., 2021b] involve selecting one or several values in a list and applying a function on this subset. RAVEN [Zhang et al., 2019] is a VQA dataset composed of sequences of images to complete.
Dataset Splits No The paper discusses 'training and test sets' and 'out-of-distribution (o.o.d) splits', and mentions fine-tuning LLaMA2 on 'training sets'. However, it does not provide specific percentages, sample counts, or clear methodologies for generating training, validation, and test splits needed for reproduction beyond mentioning the existence of these sets.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory, or cloud instance specifications) used to run the experiments.
Software Dependencies No The paper mentions that the code interpreter is in Python ('delegated to a code interpreter in Python') but does not specify any software dependencies with version numbers (e.g., Python version, specific libraries like PyTorch, TensorFlow, or HuggingFace Transformers versions).
Experiment Setup Yes For all models, we evaluate the 7B parameters versions by default. We perform experiments on a subset of our framework using Chain-of-Thought prompting [Wei et al., 2022]. By default, we give four examples to the model before asking it to answer.