reproducibilityindex.ai

The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning

Authors: Xi Ye, Greg Durrett

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Does prompting a large language model (LLM) like GPT-3 with explanations improve in-context learning? We study this question on two NLP tasks that involve reasoning over text, namely question answering and natural language inference. We test the performance of four LLMs on three textual reasoning datasets using prompts that include explanations in multiple different styles. For these tasks, we find that including explanations in the prompts for OPT, GPT-3 (davinci), and Instruct GPT (text-davinci-001) only yields small to moderate accuracy improvements over standard few-show learning. However, text-davinci-002 is able to benefit more substantially.
Researcher Affiliation	Academia	Xi Ye Greg Durrett Department of Computer Science The University of Texas at Austin {xiye,gdurrett}@cs.utexas.edu
Pseudocode	No	The paper describes the approach textually and with mathematical formulations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	Data and code available at https://github.com/xiye17/Textual Expl In Context
Open Datasets	Yes	Synthetic Multi-hop QA (SYNTH) In order to have a controlled setting where we can easily understand whether explanations are factual and consistent with the answer, we create a synthetic multi-hop QA dataset... This dataset is inspired by task 15 of the b Ab I dataset (Weston et al., 2016). In our preliminary experiments with some of the other b Ab I tasks, we found poor performance from Instruct GPT similar to our results on SYNTH, both with and without explanations. Adversarial Hotpot QA (ADVHOTPOT) We also test on the English-language Adversarial Hotpot QA dataset (Yang et al., 2018; Jiang and Bansal, 2019). E-SNLI E-SNLI (Camburu et al., 2018) is an English-language classification dataset commonly used to study explanations, released under the MIT license.
Dataset Splits	No	For few-shot learning, we use roughly the maximum allowed shots in the prompt that can fit the length limit of OPT (Zhang et al., 2022) and GPT-3 (Brown et al., 2020), which is 16 for SYNTH, 6 for ADVHOTPOT, and 32 for E-SNLI, respectively. We also train calibrators using additional examples (32 to 128 for E-SNLI), but explicit validation splits for the main experiments or the calibrator training are not provided.
Hardware Specification	No	The paper states 'We use the GPT-3 Instruct-series API (text-davinci-001),' but this does not specify the underlying hardware components like GPU models, CPU models, or memory details.
Software Dependencies	No	The paper mentions models like OPT, GPT-3, Instruct GPT, text-davinci-002, ROBERTA, and DEBERTA, and uses 'greedy decoding (temperature set to be 0)', but it does not specify version numbers for any software libraries, programming languages, or environments.
Experiment Setup	Yes	We generate outputs with greedy decoding (temperature set to be 0). Our prompt formats follow those in Brown et al. (2020). For few-shot learning, we use roughly the maximum allowed shots in the prompt that can fit the length limit of OPT (Zhang et al., 2022) and GPT-3 (Brown et al., 2020), which is 16 for SYNTH, 6 for ADVHOTPOT, and 32 for E-SNLI, respectively. We use 5 groups for Instruct GPT, the primary LM we are using throughout our paper, and 3 groups for the rest.