BASE Layers: Simplifying Training of Large, Sparse Models

Authors: Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, Luke Zettlemoyer

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments with models of up to 110B parameters demonstrate large performance gains over standard data and model parallel training strategies.
Researcher Affiliation Collaboration 1Facebook AI Research 2University of Washington.
Pseudocode Yes Figure 2 shows overall pseudo code for the approach.
Open Source Code Yes Code is publicly released.1 1https://github.com/pytorch/fairseq/
Open Datasets Yes We train on a corpus of approximately 100B tokens, comprising the training corpus of Ro BERTa (Liu et al., 2019), combined with the English portion of the CC100 corpus (Conneau et al., 2019).
Dataset Splits No While validation perplexity is shown in figures, the paper does not provide specific percentages or counts for the validation dataset split needed to reproduce the experiment.
Hardware Specification Yes Unless otherwise stated, models are trained on 128 32GB V100 GPUs connected with Infiniband.
Software Dependencies No The paper mentions software components like the Adam optimizer, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We train all models for approximately 2.5 days. All models use similar hyperparameters of 2000 warm-up steps, and the Adam optimizer (Kingma & Ba, 2014). We tune learning rates for each model separately, and linearly decay the learning rate during training. Each worker processes two sequences of length 1024, and gradients are accumulated over 8 updates. We clip gradients if their l2 norm exceeds 0.1 ( 3). Learning rates are tuned in the range {0.5, 0.75, 1.0} 10 4, taking the highest value that avoids divergence.