reproducibilityindex.ai

Large Language Models Can Be Easily Distracted by Irrelevant Context

Authors: Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Schärli, Denny Zhou

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we investigate the distractibility of large language models, i.e., how the model problem-solving accuracy can be inﬂuenced by irrelevant context. In particular, we introduce Grade-School Math with Irrelevant Context (GSM-IC), an arithmetic reasoning dataset with irrelevant information in the problem description. We use this benchmark to measure the distractibility of cutting-edge prompting techniques for large language models, and ﬁnd that the model performance is dramatically decreased when irrelevant information is included. We also identify several approaches for mitigating this deﬁciency, such as decoding with self-consistency and adding to the prompt an instruction that tells the language model to ignore the irrelevant information.
Researcher Affiliation	Collaboration	1Google Deep Mind 2Toyota Technological Institute at Chicago 3Purdue University. Correspondence to: Freda Shi <freda@ttic.edu>, Xinyun Chen <xinyunchen@google.com>, Denny Zhou <dennyzhou@google.com>.
Pseudocode	No	The information is insufficient. The paper does not contain any structured pseudocode or algorithm blocks for its methods.
Open Source Code	No	The information is insufficient. The paper states that the 'Dataset is available at https://github.com/ google-research-datasets/GSM-IC' but does not provide concrete access to the source code for the methodology described in the paper.
Open Datasets	Yes	In particular, we introduce Grade-School Math with Irrelevant Context (GSM-IC), an arithmetic reasoning dataset with irrelevant information in the problem description. ... 1Dataset is available at https://github.com/ google-research-datasets/GSM-IC.
Dataset Splits	No	The information is insufficient. The paper mentions using a 'development set' from GSM8K for constructing their dataset, but does not provide specific details about training, validation, and test splits for their main GSM-IC experiments that would be needed for reproduction.
Hardware Specification	No	The information is insufficient. The paper mentions using 'Codex (code-davinci-002) and GPT-3.5 (text-davinci-003)' for evaluation, which are large language models, but does not provide specific hardware details (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies	No	The information is insufficient. The paper mentions executing Python code using 'an external Python interpreter' but does not provide specific software dependencies with version numbers (e.g., Python version, library versions) for reproducibility.
Experiment Setup	Yes	For experiments without self-consistency decoding, we use greedy decoding (i.e., temperature τ = 0); for self-consistency experiments that require multiple samples for a problem, we sample 20 responses with temperature τ = 0.7 following Wang et al. (2022c).