A Statistical Framework for Data-dependent Retrieval-Augmented Models

Authors: Soumya Basu, Ankit Singh Rawat, Manzil Zaheer

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the utility of our proposed training methods along with the key takeaways from our statistical analysis on open domain question answering task where retrieval augmentation is important.Our evaluation is based on two benchmark datasets: NQOpen (Kwiatkowski et al., 2019) and Trivia QA (Joshi et al., 2017), which serve as sources for supervised examples (x, y), while chunked Wikipedia 2018 is used as the data-store I following literature (Karpukhin et al., 2020a).Consistent with established practices, we employ the exact match metric to assess the correspondence between the predicted answers and the ground truth.Additionally, we introduce a recall metric to measure the frequency at which the answer string appears within the retrieved documents.Observation 1 The addition of a retrieval component markedly enhances performance, as demonstrated in Tables 1 and 2, which present the exact match accuracy.Further improvements are observed when the retriever is specifically trained while keeping the predictor fixed.Joint training emerges as the most effective strategy.
Researcher Affiliation Industry 1Google, New York, USA 2Google Research, New York, USA 3Google Deep Mind, New York, USA. Correspondence to: Soumya Basu <basusoumya@google.com>.
Pseudocode No The paper describes methods using prose and mathematical equations but does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code No The paper mentions using existing open-source models like GTR (Ni et al., 2022) and T5 (Raffel et al., 2020) but does not provide a link or explicit statement about releasing the source code for their own methodology or experiments.
Open Datasets Yes Our evaluation is based on two benchmark datasets: NQOpen (Kwiatkowski et al., 2019) and Trivia QA (Joshi et al., 2017), which serve as sources for supervised examples (x, y), while chunked Wikipedia 2018 is used as the data-store I following literature (Karpukhin et al., 2020a).Dataset The versions of the open-domain QA datasets, we use are: Trivia QA: https://www.tensorflow.org/datasets/catalog/trivia_qa#trivia_qaunfilterednocontext NQOpen https://www.tensorflow.org/datasets/catalog/natural_questions_open
Dataset Splits No The paper mentions the datasets used but does not explicitly provide details about the train/validation/test splits, such as percentages or specific sample counts for each split.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper mentions using ADAM optimizer and models like GTR and T5, but it does not specify version numbers for any software dependencies like Python, PyTorch, TensorFlow, or CUDA.
Experiment Setup Yes For all of our experiments, we use ADAM weight decay optimizer with a short warm up period (2000 steps) and a linear decay schedule. We use the peak learning rate of 1 10 4. The weight decay factor is 0.1. We chose batch sizes to be 64. The number of total training steps is as follows: No retriever, train predictor ξ: 40,000; Fixed retriever θ0, train predictor ξ: 20,000; Fixed predictor ξ (θ0), train retriever θ: 20,000; Jointly train predictor ξ and retriever θ: 40,000.