reproducibilityindex.ai

Retrieval Augmented Language Model Pre-Training

Authors: Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, Mingwei Chang

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of Retrieval Augmented Language Model pre-training (REALM) by ﬁne-tuning on the challenging task of Open-domain Question Answering (Open-QA). We compare against state-of-the-art models for both explicit and implicit knowledge storage on three popular Open-QA benchmarks, and ﬁnd that we outperform all previous methods by a signiﬁcant margin (4-16% absolute accuracy), while also providing qualitative beneﬁts such as interpretability and modularity.
Researcher Affiliation	Industry	Kelvin Guu * 1 Kenton Lee * 1 Zora Tung 1 Panupong Pasupat 1 Ming-Wei Chang 1 1Google Research. Correspondence to: Kelvin Guu <kguu@google.com>, Kenton Lee <kentonl@google.com>, Zora Tung <gatoatigrado@google.com>, Panupong Pasupat <ppasupat@google.com>, Ming-Wei Chang <mingweichang@google.com>.
Pseudocode	No	The paper describes the architecture and training process of REALM in detail, including its generative process, model architecture, and training procedures. However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block or figure presenting structured steps like code.
Open Source Code	No	The paper describes the REALM framework and its implementation details but does not provide an explicit statement about releasing the source code for the methodology, nor does it include a link to a code repository.
Open Datasets	Yes	We evaluate on three popular Open-QA benchmarks (NATURALQUESTIONS-OPEN, WEBQUESTIONS, and CURATEDTREC)... The Natural Questions dataset (Kwiatkowski et al., 2019)... The Web Questions dataset (Berant et al., 2013)... The knowledge corpus is derived from the December 20, 2018 snapshot of English Wikipedia.
Dataset Splits	Yes	Table 1. Test results on Open-QA benchmarks. The number of train/test examples are shown in paretheses below each benchmark. NQ (79k/4k) WQ (3k/2k) CT (1k /1k). ... Table 2. Ablation experiments on NQ s development set.
Hardware Specification	Yes	We pre-train for 200k steps on 64 Google Cloud TPUs... The document embedding step for the MIPS index is parallelized over 16 TPUs. ...the entire model can be run on a single machine with a 12GB GPU.
Software Dependencies	No	The paper mentions using 'BERT s default optimizer' and 'BERT-style Transformers', but it does not specify exact version numbers for programming languages (e.g., Python), deep learning frameworks (e.g., TensorFlow, PyTorch), or other libraries (e.g., scikit-learn, numpy).
Experiment Setup	Yes	We pre-train for 200k steps on 64 Google Cloud TPUs, with a batch size of 512 and a learning rate of 3e-5, using BERT s default optimizer. ...we increase the number of training epochs to 4, 60, and 80 for Natural Question-Open, Web Question, and Curated Trec respectively.