Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Mixture-of-Experts with Expert Choice Routing
Authors: Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M. Dai, zhifeng Chen, Quoc V Le, James Laudon
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2 . For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and Super GLUE benchmarks. |
| Researcher Affiliation | Industry | Google, Mountain View, CA, USA EMAIL |
| Pseudocode | No | The paper does not include a distinct pseudocode block or algorithm section. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing code or a link to a code repository for the described methodology. The checklist mentions 'We include details in the experiment setup to help reproduce the main results', which refers to descriptive details rather than open-source code availability. |
| Open Datasets | Yes | Dataset: We use the high-quality dataset from GLa M [7] of 1.6 trillion tokens that are representative of a wide range of natural language use cases. |
| Dataset Splits | Yes | Model Evaluation: We mainly focus on evaluating the finetuning performance on the 11 selected tasks from GLUE and Super GLUE benchmarks [34, 35]. |
| Hardware Specification | Yes | The largest model (8B/64E) is trained on 512 TPU V4 chips. |
| Software Dependencies | No | The paper mentions using an 'Adafactor optimizer' and 'Sentence Piece subword tokenizer' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Our model training follows the setups of GLa M [7] where a maximum sequence length of 1024 tokens is adopted. We use an Adafactor optimizer [32] with first-moment decay β1 = 0 and second-moment decay β2 = 0.99. We keep the learning rate constant for the first 10K training steps, and then decay it with an inverse square root schedule. ... We use a dropout rate of 0 during training... |