Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers
Authors: Tianlong Chen, Zhenyu Zhang, AJAY KUMAR JAISWAL, Shiwei Liu, Zhangyang Wang
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments across diverse transformer architectures on a variety of tasks demonstrate the superior performance and substantial computation savings of SMo E-Dropout |
| Researcher Affiliation | Academia | 1VITA Group, University of Texas at Austin EMAIL |
| Pseudocode | Yes | Algorithm 1: Concrete Dropout in a Py Torch-like style |
| Open Source Code | Yes | Codes and models are available in https://github.com/VITA-Group/Random-Mo E-as-Dropout. |
| Open Datasets | Yes | Transformer-XL is pre-trained on enwik8 (Mahoney, 2011) dataset, while we use Books Corpus (Zhu et al., 2015) for BERT and Ro BERTa. |
| Dataset Splits | No | The paper mentions evaluating on 'the hold-out validation set' but does not specify its size, percentage, or how it was split from the main dataset. |
| Hardware Specification | Yes | {1 RTX A6000, batch size 22} and {8 V100, batch size 64} are adopted for time measurements of Transformer-XL and BERT/Ro BERTa, respectively. |
| Software Dependencies | No | The paper references Hugging Face and provides PyTorch-like pseudocode, but does not specify version numbers for these software components or any other libraries. |
| Experiment Setup | Yes | For Transformer-XL, we follow the official training setups, using Adam optimizer and the learning rate starts from 2.5 10 4 and decreases according to a cosine annealing scheduler. We use a batch size of 22 and optimize the network for 4 105 iterations. As for BERT pre-training, we adopt an Adam W optimizer with an initial learning rate of 5 10 5 that linearly decays to 0. The batch size and total training steps are 64 and 1 105, respectively. |