Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

Authors: Tianlong Chen, Zhenyu Zhang, AJAY KUMAR JAISWAL, Shiwei Liu, Zhangyang Wang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments across diverse transformer architectures on a variety of tasks demonstrate the superior performance and substantial computation savings of SMo E-Dropout
Researcher Affiliation Academia 1VITA Group, University of Texas at Austin {tianlong.chen,zhenyu.zhang,ajayjaiswal,shiwei.liu,atlaswang}@utexas.edu
Pseudocode Yes Algorithm 1: Concrete Dropout in a Py Torch-like style
Open Source Code Yes Codes and models are available in https://github.com/VITA-Group/Random-Mo E-as-Dropout.
Open Datasets Yes Transformer-XL is pre-trained on enwik8 (Mahoney, 2011) dataset, while we use Books Corpus (Zhu et al., 2015) for BERT and Ro BERTa.
Dataset Splits No The paper mentions evaluating on 'the hold-out validation set' but does not specify its size, percentage, or how it was split from the main dataset.
Hardware Specification Yes {1 RTX A6000, batch size 22} and {8 V100, batch size 64} are adopted for time measurements of Transformer-XL and BERT/Ro BERTa, respectively.
Software Dependencies No The paper references Hugging Face and provides PyTorch-like pseudocode, but does not specify version numbers for these software components or any other libraries.
Experiment Setup Yes For Transformer-XL, we follow the official training setups, using Adam optimizer and the learning rate starts from 2.5 10 4 and decreases according to a cosine annealing scheduler. We use a batch size of 22 and optimize the network for 4 105 iterations. As for BERT pre-training, we adopt an Adam W optimizer with an initial learning rate of 5 10 5 that linearly decays to 0. The batch size and total training steps are 64 and 1 105, respectively.