reproducibilityindex.ai

Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models

Authors: Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, Tu Vu, Yuexin Wu, Wuyang Chen, Albert Webson, Yunxuan Li, Vincent Y Zhao, Hongkun Yu, Kurt Keutzer, Trevor Darrell, Denny Zhou

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct empirical studies across three experimental setups: (i) Direct finetuning on individual downstream tasks devoid of instruction tuning; (ii) Instruction tuning followed by in-context few-shot or zero-shot generalization on downstream tasks; and (iii) Instruction tuning supplemented by further finetuning on individual downstream tasks.
Researcher Affiliation	Collaboration	1Google 2University of California, Berkeley 3Massachusetts Institute of Technology 4University of Massachusetts Amherst 5The University of Texas at Austin
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide a link to its source code or explicitly state that it is open-source.
Open Datasets	Yes	By default, all models are trained on the 1,836 finetuning tasks introduced by Chung et al. (2022). Specifically, Muffin comprises 80 tasks from Wei et al. (2022a) and 26 dialog/program synthesis tasks; T0-SF comprises 193 tasks from Sanh et al. (2022); NIV2 comprises 1554 tasks from Wang et al. (2022b); Co T comprises 9 reasoning tasks.
Dataset Splits	Yes	Please note, all MMLU findings presented in this paper correspond to the "validation" set. We employ the prompts in Chung et al. (2022).
Hardware Specification	Yes	We ve conducted a comparative analysis of disk memory, GPU memory and throughput under optimal batch sizes on 16 A100 DGX, using different engineering techniques and public libraries." and "Regarding training, we utilize 4x8x8 TPU Pods and internal infrastructure with carefully annotated tensor, model, and expert parallelism strategy.
Software Dependencies	No	The paper mentions using 'public libraries' and 'Ada Factor' optimizer but does not specify any software dependencies with version numbers.
Experiment Setup	Yes	We adapt the sequence length of each FLAN-MOE to 2, 048 for input and 512 for output based on the relative position embedding. The dropout rate is 0.05 and the expert dropout rate is 0.2. The learning rate is 1e 4 and the batch size is 32. The optimizer setting follows Chung et al. (2022) using Ada Factor Shazeer & Stern (2018).