Improving Language Models by Retrieving from Trillions of Tokens

Authors: Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego De Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack Rae, Erich Elsen, Laurent Sifre

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a 2 trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25 fewer parameters.
Researcher Affiliation Industry All work done at Deep Mind. Correspondence to: Sebastian Borgeaud <sborgeaud@deepmind.com>, Arthur Mensch <amensch@deepmind.com>, Jordan Hoffmann <jordanhoffmann@deepmind.com>, Laurent Sifre <sifre@deepmind.com>.
Pseudocode Yes Listing 1 contains a simplified implementation of CCA. Note that chunked cross-attention is autoregressive: the output of CCA at position i depends on the sequence from tokens from 0 to i that is input to CCA. Algorithm 1 Overview of RETRO model architecture.
Open Source Code No The paper does not contain an explicit statement about the release of source code for the methodology described, nor does it provide a direct link to a code repository for RETRO. It mentions: 'All models are implemented using JAX (Bradbury et al., 2018) and Haiku (Hennigan et al., 2020).'
Open Datasets Yes We use a multi-lingual version of Massive Text (Rae et al., 2021) for both training and retrieval data. ... Additionally, we remove all validation and test articles from Wikitext103 (Merity et al., 2017) from our Wikipedia training data. ... We evaluate our models on C4 (Raffel et al., 2020), Wikitext103 (Merity et al., 2017), Curation Corpus (Curation, 2020), Lambada (Paperno et al., 2016) and the Pile (Gao et al., 2020).
Dataset Splits Yes Additionally, we remove all validation and test articles from Wikitext103 (Merity et al., 2017) from our Wikipedia training data. ... The baseline checkpoint at step 35,000 has the lowest perplexity on Wikitext103 valid, of 21.58, for overlapping proportion of 75% (sliding window evaluation that only uses probabilities for tokens that have at least 75% of the sequence length of context, when available).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions software components like JAX, Haiku, SCaNN, and Sentence Piece, but does not provide specific version numbers for these dependencies. For example: 'All models are implemented using JAX (Bradbury et al., 2018) and Haiku (Hennigan et al., 2020).'
Experiment Setup Yes Hyperparameters are detailed in Table 7. All retrieval models use the same size encoder for the retrieval data, with d = 896 and 2 layers... The retrieval models contain one RETRO-block every 3 blocks, starting from layer 6. ... We train the BERT model for 500,000 steps with a batch size of 2,048 on the same data distribution and the same tokenizer as the baseline and retrieval models. ... and a learning rate of 1.25 10 3.