Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts
Authors: Sukwon Yun, Inyoung Choi, Jie Peng, Yangfan Wu, Jingxuan Bao, Qiyiwen Zhang, Jiayi Xin, Qi Long, Tianlong Chen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Flex-Mo E on the ADNI dataset, which encompasses four modalities in the Alzheimer s Disease domain, as well as on the MIMIC-IV dataset. The results demonstrate the effectiveness of Flex-Mo E, highlighting its ability to model arbitrary modality combinations in diverse missing modality scenarios. |
| Researcher Affiliation | Academia | 1University of North Carolina at Chapel Hill 2University of Pennsylvania 3University of Science and Technology of China |
| Pseudocode | No | The paper describes the algorithmic steps and uses mathematical equations but does not provide a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Code is available at: https://github.com/UNITES-Lab/flex-moe. |
| Open Datasets | Yes | ADNI Dataset Alzheimer s Disease Neuroimaging Initiative (ADNI) is a landmark multimodal AD dataset that tracks disease progression and pathological changes, comprising of comprehensive imaging, genetic, clinical, and biospecimen data ([64], [67])... ADNI has established standardized multi-center protocols and provides open access to qualified researchers, making it a gold-standard resource in the field ([65], [66]). |
| Dataset Splits | Yes | For the dataset split, we chose 70% for training, with the remaining 30% split evenly between validation and test sets (15% each). |
| Hardware Specification | Yes | All experiments were conducted using NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper does not provide specific version numbers for software libraries or frameworks used in the experiments (e.g., PyTorch, TensorFlow, scikit-learn). |
| Experiment Setup | Yes | To ensure a fair comparison with other baselines, we used the best hyperparameter settings provided in the original papers. If not available, we tuned the learning rate in 1e-3, 1e-4, 1e-5, the hidden dimension in 64, 128, 256, and the batch size in 8, 16. For our proposed method, we searched the number of experts in 16, 32, and Top-k in 2, 3, 4. We set the coefficient of the sum of additional losses (importance and load balancing) combined with our cross-entropy loss to 0.01, scaling it within the task classification loss. For the dataset split, we chose 70% for training, with the remaining 30% split evenly between validation and test sets (15% each). In Appendix A.2, Table 6 further details the hyperparameter setup for ADNI and MIMIC-IV datasets, including Learning rate, # of Experts, # of SMo E layers, Top-K, Training Epochs, Warm-up Epochs, Hidden dimension, Batch Size, and # of Attention Heads. |