reproducibilityindex.ai

A Statistical Framework for Data-dependent Retrieval-Augmented Models

Authors: Soumya Basu, Ankit Singh Rawat, Manzil Zaheer

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the utility of our proposed training methods along with the key takeaways from our statistical analysis on open domain question answering task where retrieval augmentation is important.Our evaluation is based on two benchmark datasets: NQOpen (Kwiatkowski et al., 2019) and Trivia QA (Joshi et al., 2017), which serve as sources for supervised examples (x, y), while chunked Wikipedia 2018 is used as the data-store I following literature (Karpukhin et al., 2020a).Consistent with established practices, we employ the exact match metric to assess the correspondence between the predicted answers and the ground truth.Additionally, we introduce a recall metric to measure the frequency at which the answer string appears within the retrieved documents.Observation 1 The addition of a retrieval component markedly enhances performance, as demonstrated in Tables 1 and 2, which present the exact match accuracy.Further improvements are observed when the retriever is speciﬁcally trained while keeping the predictor ﬁxed.Joint training emerges as the most effective strategy.
Researcher Affiliation	Industry	1Google, New York, USA 2Google Research, New York, USA 3Google Deep Mind, New York, USA. Correspondence to: Soumya Basu <basusoumya@google.com>.
Pseudocode	No	The paper describes methods using prose and mathematical equations but does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code	No	The paper mentions using existing open-source models like GTR (Ni et al., 2022) and T5 (Raffel et al., 2020) but does not provide a link or explicit statement about releasing the source code for their own methodology or experiments.
Open Datasets	Yes	Our evaluation is based on two benchmark datasets: NQOpen (Kwiatkowski et al., 2019) and Trivia QA (Joshi et al., 2017), which serve as sources for supervised examples (x, y), while chunked Wikipedia 2018 is used as the data-store I following literature (Karpukhin et al., 2020a).Dataset The versions of the open-domain QA datasets, we use are: Trivia QA: https://www.tensorflow.org/datasets/catalog/trivia_qa#trivia_qaunfilterednocontext NQOpen https://www.tensorflow.org/datasets/catalog/natural_questions_open
Dataset Splits	No	The paper mentions the datasets used but does not explicitly provide details about the train/validation/test splits, such as percentages or specific sample counts for each split.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using ADAM optimizer and models like GTR and T5, but it does not specify version numbers for any software dependencies like Python, PyTorch, TensorFlow, or CUDA.
Experiment Setup	Yes	For all of our experiments, we use ADAM weight decay optimizer with a short warm up period (2000 steps) and a linear decay schedule. We use the peak learning rate of 1 10 4. The weight decay factor is 0.1. We chose batch sizes to be 64. The number of total training steps is as follows: No retriever, train predictor ξ: 40,000; Fixed retriever θ0, train predictor ξ: 20,000; Fixed predictor ξ (θ0), train retriever θ: 20,000; Jointly train predictor ξ and retriever θ: 40,000.