Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models
Authors: Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, Tu Vu, Yuexin Wu, Wuyang Chen, Albert Webson, Yunxuan Li, Vincent Y Zhao, Hongkun Yu, Kurt Keutzer, Trevor Darrell, Denny Zhou
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct empirical studies across three experimental setups: (i) Direct finetuning on individual downstream tasks devoid of instruction tuning; (ii) Instruction tuning followed by in-context few-shot or zero-shot generalization on downstream tasks; and (iii) Instruction tuning supplemented by further finetuning on individual downstream tasks. |
| Researcher Affiliation | Collaboration | 1Google 2University of California, Berkeley 3Massachusetts Institute of Technology 4University of Massachusetts Amherst 5The University of Texas at Austin |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide a link to its source code or explicitly state that it is open-source. |
| Open Datasets | Yes | By default, all models are trained on the 1,836 finetuning tasks introduced by Chung et al. (2022). Specifically, Muffin comprises 80 tasks from Wei et al. (2022a) and 26 dialog/program synthesis tasks; T0-SF comprises 193 tasks from Sanh et al. (2022); NIV2 comprises 1554 tasks from Wang et al. (2022b); Co T comprises 9 reasoning tasks. |
| Dataset Splits | Yes | Please note, all MMLU findings presented in this paper correspond to the "validation" set. We employ the prompts in Chung et al. (2022). |
| Hardware Specification | Yes | We ve conducted a comparative analysis of disk memory, GPU memory and throughput under optimal batch sizes on 16 A100 DGX, using different engineering techniques and public libraries." and "Regarding training, we utilize 4x8x8 TPU Pods and internal infrastructure with carefully annotated tensor, model, and expert parallelism strategy. |
| Software Dependencies | No | The paper mentions using 'public libraries' and 'Ada Factor' optimizer but does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | We adapt the sequence length of each FLAN-MOE to 2, 048 for input and 512 for output based on the relative position embedding. The dropout rate is 0.05 and the expert dropout rate is 0.2. The learning rate is 1e 4 and the batch size is 32. The optimizer setting follows Chung et al. (2022) using Ada Factor Shazeer & Stern (2018). |