InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining
Authors: Boxin Wang, Wei Ping, Lawrence Mcafee, Peng Xu, Bo Li, Mohammad Shoeybi, Bryan Catanzaro
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we introduce Retro 48B, the largest LLM pretrained with retrieval. Specifically, we continue to pretrain a 43B GPT model on additional 100 billion tokens using the Retro augmentation method by retrieving from 1.2 trillion tokens. Notably, the obtained foundation model, Retro 48B, largely outperforms the counterpart GPT 43B trained on 1.2T tokens in terms of perplexity with only 2.58% additional GPU hours, demonstrating the significant scaling potential of the method. After instruction tuning on Retro, Instruct Retro demonstrates significant improvement over the instruction tuned GPT on a wide range of zero-shot tasks. Specifically, the average improvement of Instruct Retro is 7% over its GPT counterpart across 8 short-form QA and reading comprehension tasks, 10% over GPT across 4 challenging long-form QA tasks, and 16% over GPT across 3 summarization tasks. |
| Researcher Affiliation | Collaboration | 1NVIDIA 2UIUC. |
| Pseudocode | No | The paper does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and checkpoints are publicly available at: https://huggingface. co/nvidia/retro-48b-instruct-4k. |
| Open Datasets | Yes | We prepared a pretraining dataset consisting of around 1.2 trillion tokens from English natural language data. Specifically, it consists of web-crawl data from Common Crawl, news data, conversational data, book data (e.g., Book3 and Book-Corpus2 from the Pile dataset (Gao et al., 2020)), scientific and multi-domain data (e.g., Wikipedia and the Big Science ROOTS corpus (Laurençon et al., 2022)). |
| Dataset Splits | Yes | The validation corpus consists of 1% held-out samples from the pretraining corpus, which are not used in the pretraining stage, the continued pretraining stage, and the retrieval database to ensure that there is no validation data leakage. From Figure 2, one can see that after continued pretraining on additional 100 billion tokens, the perplexity of GPTfitting slightly improves over original pretrained GPT, while Retro significantly outperforms both GPT and GPT-fitting across different parameter sizes in terms of perplexity. |
| Hardware Specification | Yes | As a result, we can achieve 4ms per query over the whole pretraining corpus averaged for each chunk on a DGX-A100 node. |
| Software Dependencies | Yes | We use the Faiss index (Johnson et al., 2019) as the implementation for the dense retriever to search for approximate nearest neighbors in the BERT embedding space. |
| Experiment Setup | Yes | We finetune the LLMs by taking the loss only on the answer part with a batch size of 128 and a learning rate of 5e-6 for 1000 steps with a weight decay of 0.01. We use the Adam optimizer (Kingma & Ba, 2014) with β1 = 0.9 and β2 = 0.98. We list the pretraining hyper-parameter details of Retro-fitting in Table 4. |