Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts

Authors: Sukwon Yun, Inyoung Choi, Jie Peng, Yangfan Wu, Jingxuan Bao, Qiyiwen Zhang, Jiayi Xin, Qi Long, Tianlong Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Flex-Mo E on the ADNI dataset, which encompasses four modalities in the Alzheimer s Disease domain, as well as on the MIMIC-IV dataset. The results demonstrate the effectiveness of Flex-Mo E, highlighting its ability to model arbitrary modality combinations in diverse missing modality scenarios.
Researcher Affiliation Academia 1University of North Carolina at Chapel Hill 2University of Pennsylvania 3University of Science and Technology of China
Pseudocode No The paper describes the algorithmic steps and uses mathematical equations but does not provide a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Code is available at: https://github.com/UNITES-Lab/flex-moe.
Open Datasets Yes ADNI Dataset Alzheimer s Disease Neuroimaging Initiative (ADNI) is a landmark multimodal AD dataset that tracks disease progression and pathological changes, comprising of comprehensive imaging, genetic, clinical, and biospecimen data ([64], [67])... ADNI has established standardized multi-center protocols and provides open access to qualified researchers, making it a gold-standard resource in the field ([65], [66]).
Dataset Splits Yes For the dataset split, we chose 70% for training, with the remaining 30% split evenly between validation and test sets (15% each).
Hardware Specification Yes All experiments were conducted using NVIDIA A100 GPUs.
Software Dependencies No The paper does not provide specific version numbers for software libraries or frameworks used in the experiments (e.g., PyTorch, TensorFlow, scikit-learn).
Experiment Setup Yes To ensure a fair comparison with other baselines, we used the best hyperparameter settings provided in the original papers. If not available, we tuned the learning rate in 1e-3, 1e-4, 1e-5, the hidden dimension in 64, 128, 256, and the batch size in 8, 16. For our proposed method, we searched the number of experts in 16, 32, and Top-k in 2, 3, 4. We set the coefficient of the sum of additional losses (importance and load balancing) combined with our cross-entropy loss to 0.01, scaling it within the task classification loss. For the dataset split, we chose 70% for training, with the remaining 30% split evenly between validation and test sets (15% each). In Appendix A.2, Table 6 further details the hyperparameter setup for ADNI and MIMIC-IV datasets, including Learning rate, # of Experts, # of SMo E layers, Top-K, Training Epochs, Warm-up Epochs, Hidden dimension, Batch Size, and # of Attention Heads.