A Statistical Framework for Data-dependent Retrieval-Augmented Models
Authors: Soumya Basu, Ankit Singh Rawat, Manzil Zaheer
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the utility of our proposed training methods along with the key takeaways from our statistical analysis on open domain question answering task where retrieval augmentation is important.Our evaluation is based on two benchmark datasets: NQOpen (Kwiatkowski et al., 2019) and Trivia QA (Joshi et al., 2017), which serve as sources for supervised examples (x, y), while chunked Wikipedia 2018 is used as the data-store I following literature (Karpukhin et al., 2020a).Consistent with established practices, we employ the exact match metric to assess the correspondence between the predicted answers and the ground truth.Additionally, we introduce a recall metric to measure the frequency at which the answer string appears within the retrieved documents.Observation 1 The addition of a retrieval component markedly enhances performance, as demonstrated in Tables 1 and 2, which present the exact match accuracy.Further improvements are observed when the retriever is specifically trained while keeping the predictor fixed.Joint training emerges as the most effective strategy. |
| Researcher Affiliation | Industry | 1Google, New York, USA 2Google Research, New York, USA 3Google Deep Mind, New York, USA. Correspondence to: Soumya Basu <basusoumya@google.com>. |
| Pseudocode | No | The paper describes methods using prose and mathematical equations but does not include any clearly labeled pseudocode blocks or algorithms. |
| Open Source Code | No | The paper mentions using existing open-source models like GTR (Ni et al., 2022) and T5 (Raffel et al., 2020) but does not provide a link or explicit statement about releasing the source code for their own methodology or experiments. |
| Open Datasets | Yes | Our evaluation is based on two benchmark datasets: NQOpen (Kwiatkowski et al., 2019) and Trivia QA (Joshi et al., 2017), which serve as sources for supervised examples (x, y), while chunked Wikipedia 2018 is used as the data-store I following literature (Karpukhin et al., 2020a).Dataset The versions of the open-domain QA datasets, we use are: Trivia QA: https://www.tensorflow.org/datasets/catalog/trivia_qa#trivia_qaunfilterednocontext NQOpen https://www.tensorflow.org/datasets/catalog/natural_questions_open |
| Dataset Splits | No | The paper mentions the datasets used but does not explicitly provide details about the train/validation/test splits, such as percentages or specific sample counts for each split. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using ADAM optimizer and models like GTR and T5, but it does not specify version numbers for any software dependencies like Python, PyTorch, TensorFlow, or CUDA. |
| Experiment Setup | Yes | For all of our experiments, we use ADAM weight decay optimizer with a short warm up period (2000 steps) and a linear decay schedule. We use the peak learning rate of 1 10 4. The weight decay factor is 0.1. We chose batch sizes to be 64. The number of total training steps is as follows: No retriever, train predictor ξ: 40,000; Fixed retriever θ0, train predictor ξ: 20,000; Fixed predictor ξ (θ0), train retriever θ: 20,000; Jointly train predictor ξ and retriever θ: 40,000. |