The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning

Authors: Xi Ye, Greg Durrett

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Does prompting a large language model (LLM) like GPT-3 with explanations improve in-context learning? We study this question on two NLP tasks that involve reasoning over text, namely question answering and natural language inference. We test the performance of four LLMs on three textual reasoning datasets using prompts that include explanations in multiple different styles. For these tasks, we find that including explanations in the prompts for OPT, GPT-3 (davinci), and Instruct GPT (text-davinci-001) only yields small to moderate accuracy improvements over standard few-show learning. However, text-davinci-002 is able to benefit more substantially.
Researcher Affiliation Academia Xi Ye Greg Durrett Department of Computer Science The University of Texas at Austin {xiye,gdurrett}@cs.utexas.edu
Pseudocode No The paper describes the approach textually and with mathematical formulations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Data and code available at https://github.com/xiye17/Textual Expl In Context
Open Datasets Yes Synthetic Multi-hop QA (SYNTH) In order to have a controlled setting where we can easily understand whether explanations are factual and consistent with the answer, we create a synthetic multi-hop QA dataset... This dataset is inspired by task 15 of the b Ab I dataset (Weston et al., 2016). In our preliminary experiments with some of the other b Ab I tasks, we found poor performance from Instruct GPT similar to our results on SYNTH, both with and without explanations. Adversarial Hotpot QA (ADVHOTPOT) We also test on the English-language Adversarial Hotpot QA dataset (Yang et al., 2018; Jiang and Bansal, 2019). E-SNLI E-SNLI (Camburu et al., 2018) is an English-language classification dataset commonly used to study explanations, released under the MIT license.
Dataset Splits No For few-shot learning, we use roughly the maximum allowed shots in the prompt that can fit the length limit of OPT (Zhang et al., 2022) and GPT-3 (Brown et al., 2020), which is 16 for SYNTH, 6 for ADVHOTPOT, and 32 for E-SNLI, respectively. We also train calibrators using additional examples (32 to 128 for E-SNLI), but explicit validation splits for the main experiments or the calibrator training are not provided.
Hardware Specification No The paper states 'We use the GPT-3 Instruct-series API (text-davinci-001),' but this does not specify the underlying hardware components like GPU models, CPU models, or memory details.
Software Dependencies No The paper mentions models like OPT, GPT-3, Instruct GPT, text-davinci-002, ROBERTA, and DEBERTA, and uses 'greedy decoding (temperature set to be 0)', but it does not specify version numbers for any software libraries, programming languages, or environments.
Experiment Setup Yes We generate outputs with greedy decoding (temperature set to be 0). Our prompt formats follow those in Brown et al. (2020). For few-shot learning, we use roughly the maximum allowed shots in the prompt that can fit the length limit of OPT (Zhang et al., 2022) and GPT-3 (Brown et al., 2020), which is 16 for SYNTH, 6 for ADVHOTPOT, and 32 for E-SNLI, respectively. We use 5 groups for Instruct GPT, the primary LM we are using throughout our paper, and 3 groups for the rest.