Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning
Authors: Antonia Creswell, Murray Shanahan, Irina Higgins
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here we carry out a comprehensive evaluation of LLMs on 46 tasks that probe different aspects of logical reasoning. We show that language models tend to perform fairly well at single step inference or entailment tasks, but struggle to chain together multiple reasoning steps to solve more complex problems. Focusing on a sub-set of 10 reasoning tasks from Proof Writer and b Ab I, we show that a 7B parameter, decoder-only LLM used within the SI framework in a 5-shot generalisation setting,with no fine-tuning, yields a performance improvement of over 100% compared to an equivalent Vanilla baseline. |
| Researcher Affiliation | Industry | Antonia Creswell, Murray Shannahan & Irina Higgins Deep Mind, London, UK {tonicreswell, mshanahan, irinah}@deepmind.com |
| Pseudocode | Yes | Algorithm 1 Selection-Inference |
| Open Source Code | No | No explicit statement about the release of the code for the methodology or a link to a code repository was found. |
| Open Datasets | Yes | The additional tasks were collected from six sources: b Ab I (Weston et al., 2016), Big Bench (Ghazal et al., 2017), AAC (Betz et al., 2021), Jeopardy (Tunguz, 2019), Proof Writer (Tafjord et al., 2021) and 2Wiki Multi Hop (Ho et al., 2020) (see Fig. A5a for raw results). |
| Dataset Splits | No | No explicit mention of specific training, validation, or test dataset splits (percentages or counts) or a detailed splitting methodology was found. |
| Hardware Specification | Yes | The Selection LLM was trained for 4 104 steps (with batch size 16 for 50 hours on a TPU) with the exact string match accuracy reported in Fig. 6a. |
| Software Dependencies | No | The paper mentions software components like 'LLMs' and 'statsmodels.stats.proportion', but does not provide specific version numbers for any key software dependencies used in their experiments. |
| Experiment Setup | Yes | We evaluated decoder-only LLMs of various sizes in a 5-shot2 setting, following the same protocol use for the Big Bench evaluation in Rae et al. (2021), on a larger set of 46 tasks. The Selection LLM was trained for 4 104 steps (with batch size 16 for 50 hours on a TPU) with the exact string match accuracy reported in Fig. 6a. |