reproducibilityindex.ai

Mixture-of-Experts with Expert Choice Routing

Authors: Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M. Dai, zhifeng Chen, Quoc V Le, James Laudon

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and ﬁnd that our method improves training convergence time by more than 2 . For the same computational cost, our method demonstrates higher performance in ﬁne-tuning 11 selected tasks in the GLUE and Super GLUE benchmarks.
Researcher Affiliation	Industry	Google, Mountain View, CA, USA {yanqiz, taole, hanxiaol, dunan, huangyp, vzhao, adai, zhifengc, qvl, jlaudon}@google.com
Pseudocode	No	The paper does not include a distinct pseudocode block or algorithm section.
Open Source Code	No	The paper does not provide an explicit statement about releasing code or a link to a code repository for the described methodology. The checklist mentions 'We include details in the experiment setup to help reproduce the main results', which refers to descriptive details rather than open-source code availability.
Open Datasets	Yes	Dataset: We use the high-quality dataset from GLa M [7] of 1.6 trillion tokens that are representative of a wide range of natural language use cases.
Dataset Splits	Yes	Model Evaluation: We mainly focus on evaluating the ﬁnetuning performance on the 11 selected tasks from GLUE and Super GLUE benchmarks [34, 35].
Hardware Specification	Yes	The largest model (8B/64E) is trained on 512 TPU V4 chips.
Software Dependencies	No	The paper mentions using an 'Adafactor optimizer' and 'Sentence Piece subword tokenizer' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	Our model training follows the setups of GLa M [7] where a maximum sequence length of 1024 tokens is adopted. We use an Adafactor optimizer [32] with ﬁrst-moment decay β1 = 0 and second-moment decay β2 = 0.99. We keep the learning rate constant for the ﬁrst 10K training steps, and then decay it with an inverse square root schedule. ... We use a dropout rate of 0 during training...