Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
BASE Layers: Simplifying Training of Large, Sparse Models
Authors: Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, Luke Zettlemoyer
ICML 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments with models of up to 110B parameters demonstrate large performance gains over standard data and model parallel training strategies. |
| Researcher Affiliation | Collaboration | 1Facebook AI Research 2University of Washington. |
| Pseudocode | Yes | Figure 2 shows overall pseudo code for the approach. |
| Open Source Code | Yes | Code is publicly released.1 1https://github.com/pytorch/fairseq/ |
| Open Datasets | Yes | We train on a corpus of approximately 100B tokens, comprising the training corpus of Ro BERTa (Liu et al., 2019), combined with the English portion of the CC100 corpus (Conneau et al., 2019). |
| Dataset Splits | No | While validation perplexity is shown in figures, the paper does not provide specific percentages or counts for the validation dataset split needed to reproduce the experiment. |
| Hardware Specification | Yes | Unless otherwise stated, models are trained on 128 32GB V100 GPUs connected with In๏ฌniband. |
| Software Dependencies | No | The paper mentions software components like the Adam optimizer, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We train all models for approximately 2.5 days. All models use similar hyperparameters of 2000 warm-up steps, and the Adam optimizer (Kingma & Ba, 2014). We tune learning rates for each model separately, and linearly decay the learning rate during training. Each worker processes two sequences of length 1024, and gradients are accumulated over 8 updates. We clip gradients if their l2 norm exceeds 0.1 ( 3). Learning rates are tuned in the range {0.5, 0.75, 1.0} 10 4, taking the highest value that avoids divergence. |