MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts

Authors: Zhitian Xie, Yinger Zhang, Chenyi Zhuang, Qitao Shi, Zhining Liu, Jinjie Gu, Guannan Zhang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct plenty experiments including tabular, NLP and CV datasets, which shows Mo DE s effectiveness, universality and robustness. Furthermore, we develop a parallel study through innovatively constructing expert probing , to experimentally prove why Mo DE works: moderate distilling knowledge can improve each individual expert s test performances on their assigned tasks, leading to Mo E s overall performance improvement.
Researcher Affiliation Collaboration 1Ant Group 2Zhejiang University
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link indicating the availability of source code for the described methodology.
Open Datasets Yes Tabular Datasets 7 tabular benchmark data sets of classification task from the Open ML1 are used. Table 1 is the basic statistics of the data sets, where N, #Dim and #Classes are the number of samples, features and classes respectively. Natural Language Datasets We evaluated our approach on the task of translation, which is widely recognized in the natural language processing. For the low-resource scenario, we used datasets from the IWSLT competitions2, specifically the IWSLT14 English German (En De) and IWSLT17 English Arabic (En Ar) translations. ... For the rich-resource scenario, we used the WMT14 English German dataset... Computer Vision Datasets We apply a variety of datasets: both MNIST (Le Cun et al. 1998) and Fashion-MNIST (Xiao, Rasul, and Vollgraf 2017) consist of 60,000/10,000 examples of size 28x28 pixels for the training/test set... CIFAR10/100 (Krizhevsky, Hinton et al. 2009) has a training set with 50,000 images of size 32x32 pixels belonging to 10/100 classes.
Dataset Splits Yes For each tabular data set, we sample a random 60%, 20% and 20% of the samples as the training, validation and test set, respectively.
Hardware Specification Yes All the experiments are conducted on NVIDIA A100 GPUs.
Software Dependencies No The paper mentions "Implementation is developed on Fairseq 3." but does not provide a specific version number for Fairseq or any other software dependencies like Python or PyTorch.
Experiment Setup Yes Settings In this work, the number of experts N in all TDMo E, N-DMo E and C-DMo E is set to 2, and the total number of experts N in T-SMo E is set to 10, while the number of activated experts K = 2. The distillation factor α is set to 0.01 or 0.1 in the tablular data sets, 1 in the NLP data sets and 10 in the CV data sets.