reproducibilityindex.ai

Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning

Authors: Antonia Creswell, Murray Shanahan, Irina Higgins

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Here we carry out a comprehensive evaluation of LLMs on 46 tasks that probe different aspects of logical reasoning. We show that language models tend to perform fairly well at single step inference or entailment tasks, but struggle to chain together multiple reasoning steps to solve more complex problems. Focusing on a sub-set of 10 reasoning tasks from Proof Writer and b Ab I, we show that a 7B parameter, decoder-only LLM used within the SI framework in a 5-shot generalisation setting,with no ﬁne-tuning, yields a performance improvement of over 100% compared to an equivalent Vanilla baseline.
Researcher Affiliation	Industry	Antonia Creswell, Murray Shannahan & Irina Higgins Deep Mind, London, UK {tonicreswell, mshanahan, irinah}@deepmind.com
Pseudocode	Yes	Algorithm 1 Selection-Inference
Open Source Code	No	No explicit statement about the release of the code for the methodology or a link to a code repository was found.
Open Datasets	Yes	The additional tasks were collected from six sources: b Ab I (Weston et al., 2016), Big Bench (Ghazal et al., 2017), AAC (Betz et al., 2021), Jeopardy (Tunguz, 2019), Proof Writer (Tafjord et al., 2021) and 2Wiki Multi Hop (Ho et al., 2020) (see Fig. A5a for raw results).
Dataset Splits	No	No explicit mention of specific training, validation, or test dataset splits (percentages or counts) or a detailed splitting methodology was found.
Hardware Specification	Yes	The Selection LLM was trained for 4 104 steps (with batch size 16 for 50 hours on a TPU) with the exact string match accuracy reported in Fig. 6a.
Software Dependencies	No	The paper mentions software components like 'LLMs' and 'statsmodels.stats.proportion', but does not provide specific version numbers for any key software dependencies used in their experiments.
Experiment Setup	Yes	We evaluated decoder-only LLMs of various sizes in a 5-shot2 setting, following the same protocol use for the Big Bench evaluation in Rae et al. (2021), on a larger set of 46 tasks. The Selection LLM was trained for 4 104 steps (with batch size 16 for 50 hours on a TPU) with the exact string match accuracy reported in Fig. 6a.