Mixture-of-Experts with Expert Choice Routing

Authors: Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M. Dai, zhifeng Chen, Quoc V Le, James Laudon

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2 . For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and Super GLUE benchmarks.
Researcher Affiliation Industry Google, Mountain View, CA, USA {yanqiz, taole, hanxiaol, dunan, huangyp, vzhao, adai, zhifengc, qvl, jlaudon}@google.com
Pseudocode No The paper does not include a distinct pseudocode block or algorithm section.
Open Source Code No The paper does not provide an explicit statement about releasing code or a link to a code repository for the described methodology. The checklist mentions 'We include details in the experiment setup to help reproduce the main results', which refers to descriptive details rather than open-source code availability.
Open Datasets Yes Dataset: We use the high-quality dataset from GLa M [7] of 1.6 trillion tokens that are representative of a wide range of natural language use cases.
Dataset Splits Yes Model Evaluation: We mainly focus on evaluating the finetuning performance on the 11 selected tasks from GLUE and Super GLUE benchmarks [34, 35].
Hardware Specification Yes The largest model (8B/64E) is trained on 512 TPU V4 chips.
Software Dependencies No The paper mentions using an 'Adafactor optimizer' and 'Sentence Piece subword tokenizer' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes Our model training follows the setups of GLa M [7] where a maximum sequence length of 1024 tokens is adopted. We use an Adafactor optimizer [32] with first-moment decay β1 = 0 and second-moment decay β2 = 0.99. We keep the learning rate constant for the first 10K training steps, and then decay it with an inverse square root schedule. ... We use a dropout rate of 0 during training...