Large Language Models Can Be Easily Distracted by Irrelevant Context
Authors: Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Schärli, Denny Zhou
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we investigate the distractibility of large language models, i.e., how the model problem-solving accuracy can be influenced by irrelevant context. In particular, we introduce Grade-School Math with Irrelevant Context (GSM-IC), an arithmetic reasoning dataset with irrelevant information in the problem description. We use this benchmark to measure the distractibility of cutting-edge prompting techniques for large language models, and find that the model performance is dramatically decreased when irrelevant information is included. We also identify several approaches for mitigating this deficiency, such as decoding with self-consistency and adding to the prompt an instruction that tells the language model to ignore the irrelevant information. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2Toyota Technological Institute at Chicago 3Purdue University. Correspondence to: Freda Shi <freda@ttic.edu>, Xinyun Chen <xinyunchen@google.com>, Denny Zhou <dennyzhou@google.com>. |
| Pseudocode | No | The information is insufficient. The paper does not contain any structured pseudocode or algorithm blocks for its methods. |
| Open Source Code | No | The information is insufficient. The paper states that the 'Dataset is available at https://github.com/ google-research-datasets/GSM-IC' but does not provide concrete access to the source code for the methodology described in the paper. |
| Open Datasets | Yes | In particular, we introduce Grade-School Math with Irrelevant Context (GSM-IC), an arithmetic reasoning dataset with irrelevant information in the problem description. ... 1Dataset is available at https://github.com/ google-research-datasets/GSM-IC. |
| Dataset Splits | No | The information is insufficient. The paper mentions using a 'development set' from GSM8K for constructing their dataset, but does not provide specific details about training, validation, and test splits for their main GSM-IC experiments that would be needed for reproduction. |
| Hardware Specification | No | The information is insufficient. The paper mentions using 'Codex (code-davinci-002) and GPT-3.5 (text-davinci-003)' for evaluation, which are large language models, but does not provide specific hardware details (e.g., GPU models, CPU types, or memory) used for running the experiments. |
| Software Dependencies | No | The information is insufficient. The paper mentions executing Python code using 'an external Python interpreter' but does not provide specific software dependencies with version numbers (e.g., Python version, library versions) for reproducibility. |
| Experiment Setup | Yes | For experiments without self-consistency decoding, we use greedy decoding (i.e., temperature τ = 0); for self-consistency experiments that require multiple samples for a problem, we sample 20 responses with temperature τ = 0.7 following Wang et al. (2022c). |