reproducibilityindex.ai

Making Retrieval-Augmented Language Models Robust to Irrelevant Context

Authors: Ori Yoran, Tomer Wolfson, Ori Ram, Jonathan Berant

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we present a thorough analysis on five open-domain question answering benchmarks, characterizing cases when retrieval reduces accuracy. We then propose two methods to mitigate this issue. First, a simple baseline that filters out retrieved passages that do not entail question-answer pairs according to a natural language inference (NLI) model. This is effective in preventing performance reduction, but at a cost of also discarding relevant passages. Thus, we propose a method for automatically generating data to fine-tune the language model to properly leverage retrieved passages, including for challenging multi-hop tasks, using a mix of relevant and irrelevant contexts at training time. We empirically show that even 1,000 examples suffice to train the model to be robust to irrelevant contexts while maintaining high performance on examples with relevant ones.
Researcher Affiliation	Collaboration	Ori Yoran1 Tomer Wolfson1,2 Ori Ram1 Jonathan Berant1 1Tel Aviv University, 2Allen Institute for AI {ori.yoran, ori.ram, joberant}@cs.tau.ac.il tomerw@allenai.org
Pseudocode	No	The paper describes its methods in prose and uses diagrams (e.g., Figure 3), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	1Our code, data, and models are available at https://github.com/oriyor/ret-robust.
Open Datasets	Yes	We experiment with both single and multi-hop QA datasets. We list and give an example from each dataset in Tab. 1. Our QA benchmarks can be categorized based on their required reasoning skills: Single-hop: Information-seeking questions that do not require decomposition. We use the popular Natural Questions (NQ) dataset (Kwiatkowski et al., 2019). Explicit Reasoning: Multi-hop questions where reasoning is explicitly expressed in the question. We include 2WIKIMQA (Welbl et al., 2018) and BAMBOOGLE (Press et al., 2023). Implicit Reasoning: Mutli-hop questions where generating reasoning steps requires commonsense (implicit reasoning, Geva et al. (2021)). Such questions may have multiple valid reasoning chains. We evaluate on STRATEGYQA (Geva et al., 2021) and FERMI (Kalyan et al., 2021).
Dataset Splits	No	The paper mentions the number of training examples used for finetuning (e.g., "NQ, 1000 training examples") and states they evaluate on "500 random examples from the development set of each dataset," but it does not provide explicit overall training, validation, and test dataset splits (e.g., percentages or exact counts for the full dataset splits) needed to reproduce the data partitioning from scratch.
Hardware Specification	No	The paper states that models were trained "on a single GPU," but it does not specify the exact model of the GPU (e.g., NVIDIA A100, RTX 2080 Ti), CPU, or other detailed hardware specifications used for running the experiments.
Software Dependencies	No	The paper mentions using specific models (e.g., Llama-2-13B) and a technique (QLoRA), but it does not explicitly provide specific version numbers for key software components or libraries (e.g., Python, PyTorch, TensorFlow, or other packages) that are required to replicate the experiment.
Experiment Setup	Yes	Training hyperparameters are in A.1. We fine-tune all our models with QLo RA (Dettmers et al., 2023) for parameter efficient fine-tuning. We use the default hyperparameters from https://github. com/daniel-furman/sft-demos/blob/main/src/sft/one_gpu/llama-2/ guanaco/sft-llama-2-13b-guanaco-peft.ipynb. We train all our models for 5 epochs, with a learning rate of 2e 4 and linear scheduling on a single GPU.