Mixture-of-Experts with Expert Choice Routing
Authors: Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M. Dai, zhifeng Chen, Quoc V Le, James Laudon
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2 . For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and Super GLUE benchmarks. |
| Researcher Affiliation | Industry | Google, Mountain View, CA, USA {yanqiz, taole, hanxiaol, dunan, huangyp, vzhao, adai, zhifengc, qvl, jlaudon}@google.com |
| Pseudocode | No | The paper does not include a distinct pseudocode block or algorithm section. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing code or a link to a code repository for the described methodology. The checklist mentions 'We include details in the experiment setup to help reproduce the main results', which refers to descriptive details rather than open-source code availability. |
| Open Datasets | Yes | Dataset: We use the high-quality dataset from GLa M [7] of 1.6 trillion tokens that are representative of a wide range of natural language use cases. |
| Dataset Splits | Yes | Model Evaluation: We mainly focus on evaluating the finetuning performance on the 11 selected tasks from GLUE and Super GLUE benchmarks [34, 35]. |
| Hardware Specification | Yes | The largest model (8B/64E) is trained on 512 TPU V4 chips. |
| Software Dependencies | No | The paper mentions using an 'Adafactor optimizer' and 'Sentence Piece subword tokenizer' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Our model training follows the setups of GLa M [7] where a maximum sequence length of 1024 tokens is adopted. We use an Adafactor optimizer [32] with first-moment decay β1 = 0 and second-moment decay β2 = 0.99. We keep the learning rate constant for the first 10K training steps, and then decay it with an inverse square root schedule. ... We use a dropout rate of 0 during training... |