The MultiBERTs: BERT Reproductions for Robustness Analysis
Authors: Thibault Sellam, Steve Yadlowsky, Ian Tenney, Jason Wei, Naomi Saphra, Alexander D'Amour, Tal Linzen, Jasmijn Bastings, Iulia Raluca Turc, Jacob Eisenstein, Dipanjan Das, Ellie Pavlick
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments with pre-trained models such as BERT are often based on a single checkpoint. While the conclusions drawn apply to the artifact tested in the experiment (i.e., the particular instance of the model), it is not always clear whether they hold for the more general procedure which includes the architecture, training data, initialization scheme, and loss function. (Abstract)We present a case study to illustrate how Multi BERTs and the Multi-Bootstrap can help us draw more robust conclusions about model behavior. (Section 4) |
| Researcher Affiliation | Industry | {tsellam, yadlowsky, iftenney, epavlick}@google.com Google Research |
| Pseudocode | Yes | Below, we present a simplified Python implementation of the Multi-Bootstrap algorithm presented in Section 3.2. It describes a single-sided version of the procedure, which could be used, e.g., to test that a model s performance is greater than 0. The input is a matrix of predictions where row indices correspond to test examples and column indices to random seeds. The functions returns an array of nboot samples [bθ1, . . . , bθnboot]. (Appendix A) |
| Open Source Code | Yes | We release our models and statistical library, along with an additional set of 140 intermediate checkpoints captured during pre-training to facilitate research on learning dynamics. (Abstract)Our checkpoints and statistical libraries are available at: http://goo.gle/multiberts. (Section 1) |
| Open Datasets | Yes | We trained the models on a combination of Books Corpus (Zhu et al., 2015) and English Wikipedia. (Section 2.1) |
| Dataset Splits | Yes | We report results on the development sets of the GLUE tasks: Co LA (Warstadt et al., 2019), MNLI (matched) (Williams et al., 2018), MRPC (Dolan & Brockett, 2005), QNLI (v2) (Rajpurkar et al., 2016; Wang et al., 2019), QQP (Chen et al., 2018), RTE (Bentivogli et al., 2009), SST-2 (Socher et al., 2013), and SST-B (Cer et al., 2017). (Section 2.2 GLUE Setup)For each task, we fine-tune BERT for 3 epochs using a batch size of 32. We run a parameter sweep on learning rates [5e-5, 4e-5, 3e-5, 2e-5] and report the best score. (Section 2.2 GLUE Setup) |
| Hardware Specification | Yes | with each run taking about 4.5 days on 16 Cloud TPU v2 chips. (Section 2.1 Training Details) |
| Software Dependencies | Yes | We train using the same configuration as Devlin et al. (2019)3, with Tensor Flow (Abadi et al., 2015) version 2.5 in v1 compatibility mode. (Section 2.1 Training Details) |
| Experiment Setup | Yes | We use the BERT-Base, Uncased architecture with 12 layers and embedding size 768. (Section 2.1 Overview)We release 25 models trained for two million steps each, each training step involving a batch of 256 sequences. (Section 2.1 Checkpoints)As in the original BERT paper, we used batch size 256 and the Adam optimizer (Kingma & Ba, 2014) with learning rate 1e-4 and 10,000 warm-up steps. We used the default values for all the other parameters, except the number of steps, which we set to two million, and sequence length, which we set to 512 from the beginning with up to 80 masked tokens per sequence. (Section 2.1 Training Details)For each task, we fine-tune BERT for 3 epochs using a batch size of 32. We run a parameter sweep on learning rates [5e-5, 4e-5, 3e-5, 2e-5] and report the best score. (Section 2.2 GLUE Setup) |