Retrieval Augmented Language Model Pre-Training
Authors: Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, Mingwei Chang
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of Retrieval Augmented Language Model pre-training (REALM) by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA). We compare against state-of-the-art models for both explicit and implicit knowledge storage on three popular Open-QA benchmarks, and find that we outperform all previous methods by a significant margin (4-16% absolute accuracy), while also providing qualitative benefits such as interpretability and modularity. |
| Researcher Affiliation | Industry | Kelvin Guu * 1 Kenton Lee * 1 Zora Tung 1 Panupong Pasupat 1 Ming-Wei Chang 1 1Google Research. Correspondence to: Kelvin Guu <kguu@google.com>, Kenton Lee <kentonl@google.com>, Zora Tung <gatoatigrado@google.com>, Panupong Pasupat <ppasupat@google.com>, Ming-Wei Chang <mingweichang@google.com>. |
| Pseudocode | No | The paper describes the architecture and training process of REALM in detail, including its generative process, model architecture, and training procedures. However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block or figure presenting structured steps like code. |
| Open Source Code | No | The paper describes the REALM framework and its implementation details but does not provide an explicit statement about releasing the source code for the methodology, nor does it include a link to a code repository. |
| Open Datasets | Yes | We evaluate on three popular Open-QA benchmarks (NATURALQUESTIONS-OPEN, WEBQUESTIONS, and CURATEDTREC)... The Natural Questions dataset (Kwiatkowski et al., 2019)... The Web Questions dataset (Berant et al., 2013)... The knowledge corpus is derived from the December 20, 2018 snapshot of English Wikipedia. |
| Dataset Splits | Yes | Table 1. Test results on Open-QA benchmarks. The number of train/test examples are shown in paretheses below each benchmark. NQ (79k/4k) WQ (3k/2k) CT (1k /1k). ... Table 2. Ablation experiments on NQ s development set. |
| Hardware Specification | Yes | We pre-train for 200k steps on 64 Google Cloud TPUs... The document embedding step for the MIPS index is parallelized over 16 TPUs. ...the entire model can be run on a single machine with a 12GB GPU. |
| Software Dependencies | No | The paper mentions using 'BERT s default optimizer' and 'BERT-style Transformers', but it does not specify exact version numbers for programming languages (e.g., Python), deep learning frameworks (e.g., TensorFlow, PyTorch), or other libraries (e.g., scikit-learn, numpy). |
| Experiment Setup | Yes | We pre-train for 200k steps on 64 Google Cloud TPUs, with a batch size of 512 and a learning rate of 3e-5, using BERT s default optimizer. ...we increase the number of training epochs to 4, 60, and 80 for Natural Question-Open, Web Question, and Curated Trec respectively. |