Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

Authors: Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, Neil Houlsby

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we propose sparse upcycling a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on Super GLUE and Image Net, using only 50% of the initial dense pretraining sunk cost. The upcycled models also outperform sparse models trained from scratch on 100% of the initial dense pretraining computation budget.
Researcher Affiliation Collaboration Aran Komatsuzaki Joan Puigcerver James Lee-Thorp Georgia Institute of Technology Google Research Google Research Carlos Riquelme Basil Mustafa Joshua Ainslie Google Research Google Research Google Research Yi Tay Mostafa Dehghani Neil Houlsby Google Research Google Research Google Research
Pseudocode Yes Figure 1: The upcycling initialization process. All parameters, and optionally their optimizer state, are copied from the original checkpoint, except those corresponding to the Mo E router, which does not exist in the original architecture. In particular, the experts in the new Mo E layer are identical copies of the original MLP layer that is replaced.
Open Source Code Yes Code is available at https://github.com/google-research/vmoe (Vision) and https:// github.com/google-research/t5x/tree/main/t5x/contrib/moe (Language).
Open Datasets Yes Upstream pretraining is done on JFT300M (Sun et al., 2017)... Our language experiments follow the setup of Raffel et al. (2020): we pretrain using the span corruption task on the English C4 dataset (Raffel et al., 2020)
Dataset Splits Yes Upstream pretraining is done on JFT300M (Sun et al., 2017), with validation metrics computed on a held-out set of 894,574 examples.
Hardware Specification Yes Upcycling was performed on TPU v4 accelerators using 64 chips for Base and Large and 256 chips for XL.
Software Dependencies No The paper mentions using specific models and optimizers like 'Adafactor (Shazeer & Stern, 2018)' and 'T5 1.1 checkpoints (Narang et al., 2021; Roberts et al., 2022)', but does not provide specific version numbers for general software dependencies such as programming languages or deep learning frameworks (e.g., Python, TensorFlow/PyTorch versions).
Experiment Setup Yes We use the original hyperparameters: same batch size, learning rate schedule, and weight decay leading to the original checkpoint; see also Appendix A for full training details. We train with Adafactor (Shazeer & Stern, 2018), and decoupled weight decay (magnitude 3 on head and 0.03 on body) following Zhai et al. (2022). We use a batch size of 4096. The learning rate schedule consists of a learning warmup of 10 000 steps, followed by reverse square root decay with timescale 100 000 steps and ending with a linear cooldown to 0 over 50 000 steps. We use a fixed peak learning rate of 4 10 4.