MoEC: Mixture of Expert Clusters
Authors: Yuan Xie, Shaohan Huang, Tianyu Chen, Furu Wei
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments reveal that Mo EC could improve performance on machine translation and natural language understanding tasks. Mo EC plays a positive role in mitigating overfitting and sparse data allocation problems, thus fully releasing the potential of large-scale sparse models. |
| Researcher Affiliation | Industry | Yuan Xie *, Shaohan Huang, Tianyu Chen, Furu Wei Microsoft Research Asia, China {v-yuanxie, shaohanh, v-tianyuchen, fuwei}@microsoft.com |
| Pseudocode | No | The paper includes mathematical equations and descriptions of the model, but it does not provide pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code of this paper could be obtained from https://github.com/xy980523/MoEcmodel. |
| Open Datasets | Yes | WMT 2014 English-to-German. Ninth Workshop on Statistical Machine Translation (WMT 2014) releases a collection of datasets used in shared tasks including machine translation. We add additional news-commentary-v12 data from WMT-17 for training and validation. GLUE. General Language Understanding Evaluation (Wang et al. 2018) benchmark is a collection of tools for evaluating the performance of models across a diverse set of existing NLU tasks, including MNLI (Williams, Nangia, and Bowman 2017), Co LA (Warstadt, Singh, and Bowman 2019), SST-2 (Socher et al. 2013), QQP, QNLI (Rajpurkar et al. 2016), MRPC (Dolan and Brockett 2005) and STS-B (Cer et al. 2017). We will pre-train our model on the Books Corpus (Zhu et al. 2015) and English Wikipedia corpus for 120k steps before fine-tuning on GLUE tasks. |
| Dataset Splits | Yes | We add additional news-commentary-v12 data from WMT-17 for training and validation. WMT-14 is measured on the test set, while GLUE tasks are measured on the development sets. Batch size, training steps, and dropout rate are set by different tasks, which are recorded in Appendix C. Table 7 presents the training hyper-parameters for WMT-14 and pre-training. Table 8 presents the training hyper-parameters on downstream GLUE tasks. |
| Hardware Specification | No | The paper mentions 'hardware capacity' generally but does not provide specific details on the hardware (e.g., GPU models, CPU types) used for experiments. |
| Software Dependencies | No | The paper mentions using the Adam optimizer but does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | For our Mo EC and all baseline models, we follow the recommended settings in Vaswani et al. (2017) and use Transformer-big as the unified backbone architecture on WMT 2014 English-German translation task. For GLUE tasks, we use Transformer-base as the backbone architecture. For Mo E layers, we apply the 64-expert Mo E model with 3 FFN sub-layers in the 3rd encoder block and 3rd decoder block (same as the setting in Lewis et al. (2021)). A more detailed model hyper-parameters could be found in Appendix B. For clustering loss, we set β to 10 2 according to the experiment results (see Appendix A) and set µ = 0 by default. For a fair comparison, the dense model, Mo E baseline model, and Mo EC model share the same training hyper-parameters. All models are trained with the Adam optimizer (Kingma and Ba 2014) (β1 = 0.9, β2 = 0.98). The learning rate is set 5e 4 with 4000 warm-up steps and inverse square root scheduler (Raffel et al. 2019). Batch size, training steps, and dropout rate are set by different tasks, which are recorded in Appendix C. |