Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

Authors: Rulin Shao, Jacqueline He, Akari Asai, Weijia Shi, Tim Dettmers, Sewon Min, Luke Zettlemoyer, Pang Wei W. Koh

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Scaling Retrieval-Based Language Models with a Trillion-Token Datastore. We carry out our study by constructing a 1.4 trilliontoken datastore named MASSIVEDS... We systematically evaluate the effects of scaling MASSIVEDS on retrieval-based LMs with varying numbers of parameters and pretraining tokens (ยง 4). Beyond upstream language modeling, we also consider a suite of diverse downstream tasks, including generalknowledge question answering (QA), domain-specialized QA, and reasoning tasks. We find that, first, datastore scaling consistently improves both language modeling and some downstream tasks... (Figure 1 Left). Second, since indexing a datastore is cheaper than training on the same amount of data, retrieval-based LMs enable better compute-optimal scaling trends... (Figure 1 Right).
Researcher Affiliation Collaboration Rulin Shao1 Jacqueline He1 Akari Asai1 Weijia Shi1 Tim Dettmers1 Sewon Min1 Luke Zettlemoyer1 Pang Wei Koh1,2 1University of Washington 2Allen Institute for AI {rulins,jyyh,akari,swj0419,dettmers,sewon,lsz,pangwei} @cs.washington.edu
Pseudocode Yes Algorithm 1 Naive implementation of datastore scaling; Algorithm 2 Our efficient datastore scaling implementation
Open Source Code Yes Overall, our results show that datastore size should be considered as an integral part of LM efficiency and performance trade-offs. To facilitate future research, we open-source our datastore and code at https://github.com/Rulin Shao/retrieval-scaling.
Open Datasets Yes We first construct MASSIVEDS, a massively multi-domain datastore comprising 1.4 trillion tokens of both general web data and domain specific data (Table 2). Domain-specific data comes from a mix of datarich domains: books which span a variety of genres (Computer, 2023); open-access scientific papers (Lo et al., 2020; Soldaini & Lo, 2023; Computer, 2023); encyclopedic articles (Karpukhin et al., 2020; Computer, 2023); community questions and answers from Stack Exchange (Computer, 2023); code from Git Hub (Computer, 2023); mathematical webpages (Paster et al., 2023) and mathematical language (Welleck et al., 2021); biomedical articles (of Medicine, 2023). On the other hand, general web data is sourced from Common Crawl snapshots... and C4 (Raffel et al., 2020).
Dataset Splits No The paper defines 'evaluation' data for language modeling and downstream tasks, but it does not specify explicit 'training/test/validation dataset splits' with percentages or sample counts. It refers to established benchmarks and their inherent evaluation protocols (e.g., '5-shot prompting' for downstream tasks), but the paper itself does not detail how the primary datasets were split into these categories.
Hardware Specification No The paper mentions operating within a 'modest compute budget' and for 'an academic budget' but does not specify any particular hardware details such as GPU models, CPU types, or memory configurations used for the experiments.
Software Dependencies No The paper mentions various tools and models like 'LLAMA-2 tokenizer', 'CONTRIEVER-MSMARCO', 'FAISS', 'lm-evaluation-harness', 'MINI-LM-L12 V2', and 'DRAGON-ROBERTA'. However, it does not provide specific version numbers for these software components or libraries, which are necessary for reproducible dependency descriptions.
Experiment Setup Yes For evaluation with retrieval, we concatenate the top k = 3 documents in reverse order, so that higher-ranked documents are positioned closer to the query. For downstream tasks, we evaluate models via 5-shot prompting, and we prepend the retrieved documents before the few-shot examples, followed by the question. ... Our pipeline has three main hyper-parameters: k, K, and p. k is the number of documents used for evaluation. K is the number of documents retrieved before subsampling. p is the subsampling ratio which controls the size of the datastore. We consider k = 3, K = 1000, and p = [0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1]. We also set different random seeds for the subsampling process. We run each subsampling with three seeds (100, 101, 102) to obtain the confidence intervals in our scaling analyses.