Rethinking Embedding Coupling in Pre-trained Language Models
Authors: Hyung Won Chung, Thibault Fevry, Henry Tsai, Melvin Johnson, Sebastian Ruder
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3 EXPERIMENTAL METHODOLOGY |
| Researcher Affiliation | Industry | Hyung Won Chung Google Research hwchung@google.com Thibault F evry thibaultfevry@gmail.com Henry Tsai Google Research henrytsai@google.com Melvin Johnson Google Research melvinp@google.com Sebastian Ruder Deep Mind ruder@google.com |
| Pseudocode | No | The paper does not include any sections, figures, or blocks explicitly labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | We will release the pre-trained model checkpoint and the source code for Rem BERT in order to promote reproducibility and share the pre-training cost with other researchers. |
| Open Datasets | Yes | For our experiments, we employ tasks from the XTREME benchmark (Hu et al., 2020) that require fine-tuning, including the XNLI (Conneau et al., 2018), NER (Pan et al., 2017), PAWS-X (Yang et al., 2019), XQu AD (Artetxe et al., 2020), MLQA (Lewis et al., 2020), and Ty Di QA-Gold P (Clark et al., 2020a) datasets. We train variants of this model that differ in certain hyper-parameters but otherwise are trained under the same conditions to ensure a fair comparison. The model is trained on Wikipedia dumps in 104 languages following Devlin et al. (2019) using masked language modeling (MLM). |
| Dataset Splits | Yes | We average results across three fine-tuning runs and evaluate on the dev sets unless otherwise stated. We show statistics for them in Table 11. Table 11: Statistics for the datasets in XTREME, including the number of training, development, and test examples as well as the number of languages for each task. |
| Hardware Specification | Yes | For all pre-training except for the large scale Rem BERT, we trained using 64 Google Cloud TPUs. All fine-tuning experiments were run on 8 Cloud TPUs. |
| Software Dependencies | No | The paper mentions using 'the Sentence Piece tokenizer (Kudo & Richardson, 2018)' but does not provide specific version numbers for software libraries or dependencies like Python, PyTorch, TensorFlow, etc., beyond the citation for SentencePiece. |
| Experiment Setup | Yes | For all fine-tuning experiments other than Rem BERT, we use batch size of 32. We sweep over the learning rate values specified in Table 10. Table 10: Fine-tuning hyperparameters for all models except Rem BERT. Table 14: Hyperparameters for Rem BERT architecture and pre-training. Table 15: Hyperparameters for Rem BERT fine-tuning. |