Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers
Authors: Tianlong Chen, Zhenyu Zhang, AJAY KUMAR JAISWAL, Shiwei Liu, Zhangyang Wang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments across diverse transformer architectures on a variety of tasks demonstrate the superior performance and substantial computation savings of SMo E-Dropout |
| Researcher Affiliation | Academia | 1VITA Group, University of Texas at Austin {tianlong.chen,zhenyu.zhang,ajayjaiswal,shiwei.liu,atlaswang}@utexas.edu |
| Pseudocode | Yes | Algorithm 1: Concrete Dropout in a Py Torch-like style |
| Open Source Code | Yes | Codes and models are available in https://github.com/VITA-Group/Random-Mo E-as-Dropout. |
| Open Datasets | Yes | Transformer-XL is pre-trained on enwik8 (Mahoney, 2011) dataset, while we use Books Corpus (Zhu et al., 2015) for BERT and Ro BERTa. |
| Dataset Splits | No | The paper mentions evaluating on 'the hold-out validation set' but does not specify its size, percentage, or how it was split from the main dataset. |
| Hardware Specification | Yes | {1 RTX A6000, batch size 22} and {8 V100, batch size 64} are adopted for time measurements of Transformer-XL and BERT/Ro BERTa, respectively. |
| Software Dependencies | No | The paper references Hugging Face and provides PyTorch-like pseudocode, but does not specify version numbers for these software components or any other libraries. |
| Experiment Setup | Yes | For Transformer-XL, we follow the official training setups, using Adam optimizer and the learning rate starts from 2.5 10 4 and decreases according to a cosine annealing scheduler. We use a batch size of 22 and optimize the network for 4 105 iterations. As for BERT pre-training, we adopt an Adam W optimizer with an initial learning rate of 5 10 5 that linearly decays to 0. The batch size and total training steps are 64 and 1 105, respectively. |