reproducibilityindex.ai

BASE Layers: Simplifying Training of Large, Sparse Models

Authors: Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, Luke Zettlemoyer

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments with models of up to 110B parameters demonstrate large performance gains over standard data and model parallel training strategies.
Researcher Affiliation	Collaboration	1Facebook AI Research 2University of Washington.
Pseudocode	Yes	Figure 2 shows overall pseudo code for the approach.
Open Source Code	Yes	Code is publicly released.1 1https://github.com/pytorch/fairseq/
Open Datasets	Yes	We train on a corpus of approximately 100B tokens, comprising the training corpus of Ro BERTa (Liu et al., 2019), combined with the English portion of the CC100 corpus (Conneau et al., 2019).
Dataset Splits	No	While validation perplexity is shown in figures, the paper does not provide specific percentages or counts for the validation dataset split needed to reproduce the experiment.
Hardware Specification	Yes	Unless otherwise stated, models are trained on 128 32GB V100 GPUs connected with Inﬁniband.
Software Dependencies	No	The paper mentions software components like the Adam optimizer, but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	We train all models for approximately 2.5 days. All models use similar hyperparameters of 2000 warm-up steps, and the Adam optimizer (Kingma & Ba, 2014). We tune learning rates for each model separately, and linearly decay the learning rate during training. Each worker processes two sequences of length 1024, and gradients are accumulated over 8 updates. We clip gradients if their l2 norm exceeds 0.1 ( 3). Learning rates are tuned in the range {0.5, 0.75, 1.0} 10 4, taking the highest value that avoids divergence.