CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

Authors: Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu XU, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train the Cu Mo models on a mixture of open-sourced datasets, which are converted into the visual instruction tuning format. Then, we conduct comprehensive evaluations of the performance of Cu Mo models across various competitive VQA-based and instruction-following-based benchmarks. Additionally, we perform ablation studies on each module with upcycled Mo E blocks with qualitative analysis of the results.
Researcher Affiliation Collaboration Jiachen Li1* Xinyao Wang2 Sijie Zhu2 Chia-Wen Kuo2 Lu Xu2 Fan Chen2 Jitesh Jain1 Humphrey Shi1 Longyin Wen2 1SHI Labs @ Georgia Tech & UIUC 2Byte Dance Inc., San Jose
Pseudocode No The paper describes the architecture and training process in detail but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/SHI-Labs/Cu Mo
Open Datasets Yes Our models are trained fully on open-sourced datasets that are converted to visual instruction following formats. The total data size for visual instruction tuning is approximately 1.65 million, and all training data are publicly accessible.
Dataset Splits No The paper lists various datasets used for training and benchmarks for evaluation, but it does not explicitly provide specific percentages or counts for training/validation/test splits from its training mixture (e.g., 'X% for training, Y% for validation, Z% for testing'). Instead, it uses pre-defined external benchmarks for evaluation.
Hardware Specification Yes For the final results presented in Table 1, the model was trained using 32 A100 GPUs with a total batch size of 256 and a learning rate of 4e-6. All ablation studies were conducted with a total batch size of 128 and learning rates of 2e-5 and 2e-6, as detailed in Section 4.3. GPUs: 8 A100 (PT), 16 A100 (PFT), 32 A100 (VIT) (Table 6)
Software Dependencies No The paper does not list specific version numbers for software dependencies (e.g., Python, PyTorch, CUDA, specific library versions).
Experiment Setup Yes Table 6 provides an overview of the main hyperparameters used during the three-stage training process. For the final results presented in Table 1, the model was trained using 32 A100 GPUs with a total batch size of 256 and a learning rate of 4e-6. All ablation studies were conducted with a total batch size of 128 and learning rates of 2e-5 and 2e-6, as detailed in Section 4.3. (Section 4.1 and Table 6)