reproducibilityindex.ai

RA-DIT: Retrieval-Augmented Dual Instruction Tuning

Authors: Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Richard James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, Luke Zettlemoyer, Wen-tau Yih

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that each fine-tuning step offers significant performance gains, and that the finetuned LLM and retriever can be combined to achieve further improvements. Our largest model, RA-DIT 65B, attains state-of-the-art performance in zeroand few-shot settings on knowledge intensive benchmarks, notably surpassing the un-tuned in-context RALM approach on datasets including MMLU (Hendrycks et al., 2021a) (+8.2% 0-shot; +0.7% 5-shot) and Natural Questions (Kwiatkowski et al., 2019) (+22% 0-shot; +3.8% 5-shot).
Researcher Affiliation	Industry	FAIR at Meta {victorialin,xilun,mingdachen,scottyih}@meta.com
Pseudocode	No	The paper describes algorithms and methods using natural language and mathematical equations but does not include explicit pseudocode blocks or sections labeled “Algorithm”.
Open Source Code	Yes	We release the scripts for indexing Common Crawl data and generating our fine-tuning and inference prompts at: https://github.com/facebookresearch/RA-DIT.
Open Datasets	Yes	We choose a set of fine-tuning tasks aimed at boosting the language model s ability to utilize knowledge effectively and improving its contextual awareness in generating predictions. As shown in Table 1, our language model fine-tuning datasets (DL) consists of 20 datasets across 5 distinct categories: dialogue, open-domain QA, reading comprehension3, summarization and chain-of-thought reasoning. For retriever fine-tuning datasets DR, we opt for the QA datasets in our collection featuring standalone questions, and we additionally include two QA datasets, Freebase QA (Jiang et al., 2019) and MS-MARCO (Nguyen et al., 2016). The examples of each dataset are serialized for instruction tuning using manually compiled templates (Table 10). For tasks in DL DR, we use the same template for both fine-tuning steps. In addition, we observe that supplementing the instructiontuning data with unsupervised text leads to additional performance gains for both language model and retriever fine-tuning, and we detail data mixture used in Appendix B. Table 1: Our intruction tuning datasets. All datasets are downloaded from Hugging Face (Lhoest et al., 2021), with the exception of those marked with , which are taken from Iyer et al. (2022).
Dataset Splits	Yes	We evaluate the models every 100 steps, and select the best checkpoint based on the average dev set performance over the 6 development KILT tasks shown in Table 11 (early stopping). Model validation is performed once every 500 steps using the same mean reciprocal rank (MRR) metric as in the original DRAGON paper (Lin et al., 2023), on a combined validation set from the 10-task MTI data.
Hardware Specification	Yes	We fine-tune the 7B, 13B and 65B LLAMA models using 8, 16 and 64 A100 GPUs, respectively.
Software Dependencies	No	The paper mentions several software components and models (e.g., LLAMA, DRAGON+, Hugging Face datasets, dpr-scale codebase) but does not provide specific version numbers for general software dependencies like Python, PyTorch, etc.
Experiment Setup	Yes	Table 8: Hyperparameters for retrieval-augmented LM fine-tuning. Table 9: Hyperparameters for 64-shot fine-tuning on the eval tasks. Appendix B provides details on fine-tuning dataset selection, retrieval-augmented LM fine-tuning, and retriever fine-tuning.