BASE Layers: Simplifying Training of Large, Sparse Models
Authors: Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, Luke Zettlemoyer
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments with models of up to 110B parameters demonstrate large performance gains over standard data and model parallel training strategies. |
| Researcher Affiliation | Collaboration | 1Facebook AI Research 2University of Washington. |
| Pseudocode | Yes | Figure 2 shows overall pseudo code for the approach. |
| Open Source Code | Yes | Code is publicly released.1 1https://github.com/pytorch/fairseq/ |
| Open Datasets | Yes | We train on a corpus of approximately 100B tokens, comprising the training corpus of Ro BERTa (Liu et al., 2019), combined with the English portion of the CC100 corpus (Conneau et al., 2019). |
| Dataset Splits | No | While validation perplexity is shown in figures, the paper does not provide specific percentages or counts for the validation dataset split needed to reproduce the experiment. |
| Hardware Specification | Yes | Unless otherwise stated, models are trained on 128 32GB V100 GPUs connected with Infiniband. |
| Software Dependencies | No | The paper mentions software components like the Adam optimizer, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We train all models for approximately 2.5 days. All models use similar hyperparameters of 2000 warm-up steps, and the Adam optimizer (Kingma & Ba, 2014). We tune learning rates for each model separately, and linearly decay the learning rate during training. Each worker processes two sequences of length 1024, and gradients are accumulated over 8 updates. We clip gradients if their l2 norm exceeds 0.1 ( 3). Learning rates are tuned in the range {0.5, 0.75, 1.0} 10 4, taking the highest value that avoids divergence. |