reproducibilityindex.ai

The MultiBERTs: BERT Reproductions for Robustness Analysis

Authors: Thibault Sellam, Steve Yadlowsky, Ian Tenney, Jason Wei, Naomi Saphra, Alexander D'Amour, Tal Linzen, Jasmijn Bastings, Iulia Raluca Turc, Jacob Eisenstein, Dipanjan Das, Ellie Pavlick

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments with pre-trained models such as BERT are often based on a single checkpoint. While the conclusions drawn apply to the artifact tested in the experiment (i.e., the particular instance of the model), it is not always clear whether they hold for the more general procedure which includes the architecture, training data, initialization scheme, and loss function. (Abstract)We present a case study to illustrate how Multi BERTs and the Multi-Bootstrap can help us draw more robust conclusions about model behavior. (Section 4)
Researcher Affiliation	Industry	{tsellam, yadlowsky, iftenney, epavlick}@google.com Google Research
Pseudocode	Yes	Below, we present a simpliﬁed Python implementation of the Multi-Bootstrap algorithm presented in Section 3.2. It describes a single-sided version of the procedure, which could be used, e.g., to test that a model s performance is greater than 0. The input is a matrix of predictions where row indices correspond to test examples and column indices to random seeds. The functions returns an array of nboot samples [bθ1, . . . , bθnboot]. (Appendix A)
Open Source Code	Yes	We release our models and statistical library, along with an additional set of 140 intermediate checkpoints captured during pre-training to facilitate research on learning dynamics. (Abstract)Our checkpoints and statistical libraries are available at: http://goo.gle/multiberts. (Section 1)
Open Datasets	Yes	We trained the models on a combination of Books Corpus (Zhu et al., 2015) and English Wikipedia. (Section 2.1)
Dataset Splits	Yes	We report results on the development sets of the GLUE tasks: Co LA (Warstadt et al., 2019), MNLI (matched) (Williams et al., 2018), MRPC (Dolan & Brockett, 2005), QNLI (v2) (Rajpurkar et al., 2016; Wang et al., 2019), QQP (Chen et al., 2018), RTE (Bentivogli et al., 2009), SST-2 (Socher et al., 2013), and SST-B (Cer et al., 2017). (Section 2.2 GLUE Setup)For each task, we ﬁne-tune BERT for 3 epochs using a batch size of 32. We run a parameter sweep on learning rates [5e-5, 4e-5, 3e-5, 2e-5] and report the best score. (Section 2.2 GLUE Setup)
Hardware Specification	Yes	with each run taking about 4.5 days on 16 Cloud TPU v2 chips. (Section 2.1 Training Details)
Software Dependencies	Yes	We train using the same conﬁguration as Devlin et al. (2019)3, with Tensor Flow (Abadi et al., 2015) version 2.5 in v1 compatibility mode. (Section 2.1 Training Details)
Experiment Setup	Yes	We use the BERT-Base, Uncased architecture with 12 layers and embedding size 768. (Section 2.1 Overview)We release 25 models trained for two million steps each, each training step involving a batch of 256 sequences. (Section 2.1 Checkpoints)As in the original BERT paper, we used batch size 256 and the Adam optimizer (Kingma & Ba, 2014) with learning rate 1e-4 and 10,000 warm-up steps. We used the default values for all the other parameters, except the number of steps, which we set to two million, and sequence length, which we set to 512 from the beginning with up to 80 masked tokens per sequence. (Section 2.1 Training Details)For each task, we ﬁne-tune BERT for 3 epochs using a batch size of 32. We run a parameter sweep on learning rates [5e-5, 4e-5, 3e-5, 2e-5] and report the best score. (Section 2.2 GLUE Setup)