Retrieval Augmented Language Model Pre-Training

Authors: Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, Mingwei Chang

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of Retrieval Augmented Language Model pre-training (REALM) by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA). We compare against state-of-the-art models for both explicit and implicit knowledge storage on three popular Open-QA benchmarks, and find that we outperform all previous methods by a significant margin (4-16% absolute accuracy), while also providing qualitative benefits such as interpretability and modularity.
Researcher Affiliation Industry Kelvin Guu * 1 Kenton Lee * 1 Zora Tung 1 Panupong Pasupat 1 Ming-Wei Chang 1 1Google Research. Correspondence to: Kelvin Guu <kguu@google.com>, Kenton Lee <kentonl@google.com>, Zora Tung <gatoatigrado@google.com>, Panupong Pasupat <ppasupat@google.com>, Ming-Wei Chang <mingweichang@google.com>.
Pseudocode No The paper describes the architecture and training process of REALM in detail, including its generative process, model architecture, and training procedures. However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block or figure presenting structured steps like code.
Open Source Code No The paper describes the REALM framework and its implementation details but does not provide an explicit statement about releasing the source code for the methodology, nor does it include a link to a code repository.
Open Datasets Yes We evaluate on three popular Open-QA benchmarks (NATURALQUESTIONS-OPEN, WEBQUESTIONS, and CURATEDTREC)... The Natural Questions dataset (Kwiatkowski et al., 2019)... The Web Questions dataset (Berant et al., 2013)... The knowledge corpus is derived from the December 20, 2018 snapshot of English Wikipedia.
Dataset Splits Yes Table 1. Test results on Open-QA benchmarks. The number of train/test examples are shown in paretheses below each benchmark. NQ (79k/4k) WQ (3k/2k) CT (1k /1k). ... Table 2. Ablation experiments on NQ s development set.
Hardware Specification Yes We pre-train for 200k steps on 64 Google Cloud TPUs... The document embedding step for the MIPS index is parallelized over 16 TPUs. ...the entire model can be run on a single machine with a 12GB GPU.
Software Dependencies No The paper mentions using 'BERT s default optimizer' and 'BERT-style Transformers', but it does not specify exact version numbers for programming languages (e.g., Python), deep learning frameworks (e.g., TensorFlow, PyTorch), or other libraries (e.g., scikit-learn, numpy).
Experiment Setup Yes We pre-train for 200k steps on 64 Google Cloud TPUs, with a batch size of 512 and a learning rate of 3e-5, using BERT s default optimizer. ...we increase the number of training epochs to 4, 60, and 80 for Natural Question-Open, Web Question, and Curated Trec respectively.