Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

Authors: Tianlong Chen, Zhenyu Zhang, AJAY KUMAR JAISWAL, Shiwei Liu, Zhangyang Wang

ICLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments across diverse transformer architectures on a variety of tasks demonstrate the superior performance and substantial computation savings of SMo E-Dropout
Researcher Affiliation Academia 1VITA Group, University of Texas at Austin EMAIL
Pseudocode Yes Algorithm 1: Concrete Dropout in a Py Torch-like style
Open Source Code Yes Codes and models are available in https://github.com/VITA-Group/Random-Mo E-as-Dropout.
Open Datasets Yes Transformer-XL is pre-trained on enwik8 (Mahoney, 2011) dataset, while we use Books Corpus (Zhu et al., 2015) for BERT and Ro BERTa.
Dataset Splits No The paper mentions evaluating on 'the hold-out validation set' but does not specify its size, percentage, or how it was split from the main dataset.
Hardware Specification Yes {1 RTX A6000, batch size 22} and {8 V100, batch size 64} are adopted for time measurements of Transformer-XL and BERT/Ro BERTa, respectively.
Software Dependencies No The paper references Hugging Face and provides PyTorch-like pseudocode, but does not specify version numbers for these software components or any other libraries.
Experiment Setup Yes For Transformer-XL, we follow the official training setups, using Adam optimizer and the learning rate starts from 2.5 10 4 and decreases according to a cosine annealing scheduler. We use a batch size of 22 and optimize the network for 4 105 iterations. As for BERT pre-training, we adopt an Adam W optimizer with an initial learning rate of 5 10 5 that linearly decays to 0. The batch size and total training steps are 64 and 1 105, respectively.