Scaling Retrieval-Based Language Models with a Trillion-Token Datastore
Authors: Rulin Shao, Jacqueline He, Akari Asai, Weijia Shi, Tim Dettmers, Sewon Min, Luke Zettlemoyer, Pang Wei W. Koh
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Scaling Retrieval-Based Language Models with a Trillion-Token Datastore. We carry out our study by constructing a 1.4 trilliontoken datastore named MASSIVEDS... We systematically evaluate the effects of scaling MASSIVEDS on retrieval-based LMs with varying numbers of parameters and pretraining tokens (ยง 4). Beyond upstream language modeling, we also consider a suite of diverse downstream tasks, including generalknowledge question answering (QA), domain-specialized QA, and reasoning tasks. We find that, first, datastore scaling consistently improves both language modeling and some downstream tasks... (Figure 1 Left). Second, since indexing a datastore is cheaper than training on the same amount of data, retrieval-based LMs enable better compute-optimal scaling trends... (Figure 1 Right). |
| Researcher Affiliation | Collaboration | Rulin Shao1 Jacqueline He1 Akari Asai1 Weijia Shi1 Tim Dettmers1 Sewon Min1 Luke Zettlemoyer1 Pang Wei Koh1,2 1University of Washington 2Allen Institute for AI {rulins,jyyh,akari,swj0419,dettmers,sewon,lsz,pangwei} @cs.washington.edu |
| Pseudocode | Yes | Algorithm 1 Naive implementation of datastore scaling; Algorithm 2 Our efficient datastore scaling implementation |
| Open Source Code | Yes | Overall, our results show that datastore size should be considered as an integral part of LM efficiency and performance trade-offs. To facilitate future research, we open-source our datastore and code at https://github.com/Rulin Shao/retrieval-scaling. |
| Open Datasets | Yes | We first construct MASSIVEDS, a massively multi-domain datastore comprising 1.4 trillion tokens of both general web data and domain specific data (Table 2). Domain-specific data comes from a mix of datarich domains: books which span a variety of genres (Computer, 2023); open-access scientific papers (Lo et al., 2020; Soldaini & Lo, 2023; Computer, 2023); encyclopedic articles (Karpukhin et al., 2020; Computer, 2023); community questions and answers from Stack Exchange (Computer, 2023); code from Git Hub (Computer, 2023); mathematical webpages (Paster et al., 2023) and mathematical language (Welleck et al., 2021); biomedical articles (of Medicine, 2023). On the other hand, general web data is sourced from Common Crawl snapshots... and C4 (Raffel et al., 2020). |
| Dataset Splits | No | The paper defines 'evaluation' data for language modeling and downstream tasks, but it does not specify explicit 'training/test/validation dataset splits' with percentages or sample counts. It refers to established benchmarks and their inherent evaluation protocols (e.g., '5-shot prompting' for downstream tasks), but the paper itself does not detail how the primary datasets were split into these categories. |
| Hardware Specification | No | The paper mentions operating within a 'modest compute budget' and for 'an academic budget' but does not specify any particular hardware details such as GPU models, CPU types, or memory configurations used for the experiments. |
| Software Dependencies | No | The paper mentions various tools and models like 'LLAMA-2 tokenizer', 'CONTRIEVER-MSMARCO', 'FAISS', 'lm-evaluation-harness', 'MINI-LM-L12 V2', and 'DRAGON-ROBERTA'. However, it does not provide specific version numbers for these software components or libraries, which are necessary for reproducible dependency descriptions. |
| Experiment Setup | Yes | For evaluation with retrieval, we concatenate the top k = 3 documents in reverse order, so that higher-ranked documents are positioned closer to the query. For downstream tasks, we evaluate models via 5-shot prompting, and we prepend the retrieved documents before the few-shot examples, followed by the question. ... Our pipeline has three main hyper-parameters: k, K, and p. k is the number of documents used for evaluation. K is the number of documents retrieved before subsampling. p is the subsampling ratio which controls the size of the datastore. We consider k = 3, K = 1000, and p = [0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1]. We also set different random seeds for the subsampling process. We run each subsampling with three seeds (100, 101, 102) to obtain the confidence intervals in our scaling analyses. |