Rethinking Embedding Coupling in Pre-trained Language Models

Authors: Hyung Won Chung, Thibault Fevry, Henry Tsai, Melvin Johnson, Sebastian Ruder

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 3 EXPERIMENTAL METHODOLOGY
Researcher Affiliation Industry Hyung Won Chung Google Research hwchung@google.com Thibault F evry thibaultfevry@gmail.com Henry Tsai Google Research henrytsai@google.com Melvin Johnson Google Research melvinp@google.com Sebastian Ruder Deep Mind ruder@google.com
Pseudocode No The paper does not include any sections, figures, or blocks explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code Yes We will release the pre-trained model checkpoint and the source code for Rem BERT in order to promote reproducibility and share the pre-training cost with other researchers.
Open Datasets Yes For our experiments, we employ tasks from the XTREME benchmark (Hu et al., 2020) that require fine-tuning, including the XNLI (Conneau et al., 2018), NER (Pan et al., 2017), PAWS-X (Yang et al., 2019), XQu AD (Artetxe et al., 2020), MLQA (Lewis et al., 2020), and Ty Di QA-Gold P (Clark et al., 2020a) datasets. We train variants of this model that differ in certain hyper-parameters but otherwise are trained under the same conditions to ensure a fair comparison. The model is trained on Wikipedia dumps in 104 languages following Devlin et al. (2019) using masked language modeling (MLM).
Dataset Splits Yes We average results across three fine-tuning runs and evaluate on the dev sets unless otherwise stated. We show statistics for them in Table 11. Table 11: Statistics for the datasets in XTREME, including the number of training, development, and test examples as well as the number of languages for each task.
Hardware Specification Yes For all pre-training except for the large scale Rem BERT, we trained using 64 Google Cloud TPUs. All fine-tuning experiments were run on 8 Cloud TPUs.
Software Dependencies No The paper mentions using 'the Sentence Piece tokenizer (Kudo & Richardson, 2018)' but does not provide specific version numbers for software libraries or dependencies like Python, PyTorch, TensorFlow, etc., beyond the citation for SentencePiece.
Experiment Setup Yes For all fine-tuning experiments other than Rem BERT, we use batch size of 32. We sweep over the learning rate values specified in Table 10. Table 10: Fine-tuning hyperparameters for all models except Rem BERT. Table 14: Hyperparameters for Rem BERT architecture and pre-training. Table 15: Hyperparameters for Rem BERT fine-tuning.