reproducibilityindex.ai

Large Language Models Are Not Strong Abstract Reasoners

Authors: Gaël Gendron, Qiming Bao, Michael Witbrock, Gillian Dobbie

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform extensive evaluations of state-of-the-art LLMs, showing that they currently achieve very limited performance in contrast with other natural language tasks, even when applying techniques that have been shown to improve performance on other NLP tasks.
Researcher Affiliation	Academia	Ga el Gendron , Qiming Bao , Michael Witbrock , Gillian Dobbie University of Auckland gael.gendron@auckland.ac.nz
Pseudocode	No	The paper does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code	Yes	We release our code and data at: https://github.com/ Strong-AI-Lab/Logical-and-abstract-reasoning.
Open Datasets	Yes	We build a text-based version of the Abstract Causal Reasoning (ACRE) dataset [Zhang et al., 2021a] that we name ACRET . The second dataset is the Abstract Reasoning Challenge (ARC) dataset [Chollet, 2019]. We select a subset of the BIG-Bench dataset [Rule, 2020; Srivastava et al., 2022] that we name BIG-Bench-F for Functions. We select a subset of the Evals dataset [Open AI, 2023] representing logic puzzles. Pointer-Value Retrieval (PVR) tasks [Zhang et al., 2021b] involve selecting one or several values in a list and applying a function on this subset. RAVEN [Zhang et al., 2019] is a VQA dataset composed of sequences of images to complete.
Dataset Splits	No	The paper discusses 'training and test sets' and 'out-of-distribution (o.o.d) splits', and mentions fine-tuning LLaMA2 on 'training sets'. However, it does not provide specific percentages, sample counts, or clear methodologies for generating training, validation, and test splits needed for reproduction beyond mentioning the existence of these sets.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory, or cloud instance specifications) used to run the experiments.
Software Dependencies	No	The paper mentions that the code interpreter is in Python ('delegated to a code interpreter in Python') but does not specify any software dependencies with version numbers (e.g., Python version, specific libraries like PyTorch, TensorFlow, or HuggingFace Transformers versions).
Experiment Setup	Yes	For all models, we evaluate the 7B parameters versions by default. We perform experiments on a subset of our framework using Chain-of-Thought prompting [Wei et al., 2022]. By default, we give four examples to the model before asking it to answer.