CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Authors: Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu XU, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train the Cu Mo models on a mixture of open-sourced datasets, which are converted into the visual instruction tuning format. Then, we conduct comprehensive evaluations of the performance of Cu Mo models across various competitive VQA-based and instruction-following-based benchmarks. Additionally, we perform ablation studies on each module with upcycled Mo E blocks with qualitative analysis of the results. |
| Researcher Affiliation | Collaboration | Jiachen Li1* Xinyao Wang2 Sijie Zhu2 Chia-Wen Kuo2 Lu Xu2 Fan Chen2 Jitesh Jain1 Humphrey Shi1 Longyin Wen2 1SHI Labs @ Georgia Tech & UIUC 2Byte Dance Inc., San Jose |
| Pseudocode | No | The paper describes the architecture and training process in detail but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/SHI-Labs/Cu Mo |
| Open Datasets | Yes | Our models are trained fully on open-sourced datasets that are converted to visual instruction following formats. The total data size for visual instruction tuning is approximately 1.65 million, and all training data are publicly accessible. |
| Dataset Splits | No | The paper lists various datasets used for training and benchmarks for evaluation, but it does not explicitly provide specific percentages or counts for training/validation/test splits from its training mixture (e.g., 'X% for training, Y% for validation, Z% for testing'). Instead, it uses pre-defined external benchmarks for evaluation. |
| Hardware Specification | Yes | For the final results presented in Table 1, the model was trained using 32 A100 GPUs with a total batch size of 256 and a learning rate of 4e-6. All ablation studies were conducted with a total batch size of 128 and learning rates of 2e-5 and 2e-6, as detailed in Section 4.3. GPUs: 8 A100 (PT), 16 A100 (PFT), 32 A100 (VIT) (Table 6) |
| Software Dependencies | No | The paper does not list specific version numbers for software dependencies (e.g., Python, PyTorch, CUDA, specific library versions). |
| Experiment Setup | Yes | Table 6 provides an overview of the main hyperparameters used during the three-stage training process. For the final results presented in Table 1, the model was trained using 32 A100 GPUs with a total batch size of 256 and a learning rate of 4e-6. All ablation studies were conducted with a total batch size of 128 and learning rates of 2e-5 and 2e-6, as detailed in Section 4.3. (Section 4.1 and Table 6) |