Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models

Authors: Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, Tu Vu, Yuexin Wu, Wuyang Chen, Albert Webson, Yunxuan Li, Vincent Y Zhao, Hongkun Yu, Kurt Keutzer, Trevor Darrell, Denny Zhou

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct empirical studies across three experimental setups: (i) Direct finetuning on individual downstream tasks devoid of instruction tuning; (ii) Instruction tuning followed by in-context few-shot or zero-shot generalization on downstream tasks; and (iii) Instruction tuning supplemented by further finetuning on individual downstream tasks.
Researcher Affiliation Collaboration 1Google 2University of California, Berkeley 3Massachusetts Institute of Technology 4University of Massachusetts Amherst 5The University of Texas at Austin
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide a link to its source code or explicitly state that it is open-source.
Open Datasets Yes By default, all models are trained on the 1,836 finetuning tasks introduced by Chung et al. (2022). Specifically, Muffin comprises 80 tasks from Wei et al. (2022a) and 26 dialog/program synthesis tasks; T0-SF comprises 193 tasks from Sanh et al. (2022); NIV2 comprises 1554 tasks from Wang et al. (2022b); Co T comprises 9 reasoning tasks.
Dataset Splits Yes Please note, all MMLU findings presented in this paper correspond to the "validation" set. We employ the prompts in Chung et al. (2022).
Hardware Specification Yes We ve conducted a comparative analysis of disk memory, GPU memory and throughput under optimal batch sizes on 16 A100 DGX, using different engineering techniques and public libraries." and "Regarding training, we utilize 4x8x8 TPU Pods and internal infrastructure with carefully annotated tensor, model, and expert parallelism strategy.
Software Dependencies No The paper mentions using 'public libraries' and 'Ada Factor' optimizer but does not specify any software dependencies with version numbers.
Experiment Setup Yes We adapt the sequence length of each FLAN-MOE to 2, 048 for input and 512 for output based on the relative position embedding. The dropout rate is 0.05 and the expert dropout rate is 0.2. The learning rate is 1e 4 and the batch size is 32. The optimizer setting follows Chung et al. (2022) using Ada Factor Shazeer & Stern (2018).