MoVA: Adapting Mixture of Vision Experts to Multimodal Context
Authors: ZHUOFAN ZONG, Bingqi Ma, Dazhong Shen, Guanglu Song, Hao Shao, DONGZHI JIANG, Hongsheng Li, Yu Liu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments to evaluate the effectiveness of the proposed approach. Without any bells and whistles, Mo VA can achieve significant performance gains over current state-of-the-art methods in a wide range of challenging multimodal benchmarks. |
| Researcher Affiliation | Collaboration | Zhuofan Zong1,2, Bingqi Ma2, Dazhong Shen3 Guanglu Song2 Hao Shao1 Dongzhi Jiang1 Hongsheng Li1,3,4, Yu Liu2, 1CUHK MMLab 2Sense Time Research 3Shanghai AI Laboratory 4CPII under Inno HK |
| Pseudocode | No | The paper describes the methodology in detail but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Codes and models are available at https://github.com/Temple X98/Mo VA. |
| Open Datasets | Yes | In the pretraining stage, we first construct 15M visual instruction samples across diverse domains as the training data: (i) Image caption data that covers 4M randomly selected samples from Data Comp-1B [56], Share GPT4V-PT [52], and ALLa VA-4V [57]. (ii) Visual grounding and localization dataset that encompasses Objects365 [58], Ref COCO [21], Visual Genome [59], Point QA [60], and Flickr30K [61]. (iii) Chart understanding data that includes MMC-Instruction [62], Chart2Text [63], DVQA [64], and Sci Graph QA [65]. (iv) Text recognition and document parsing data that covers LLa VAR-PT [66] and 3M English document images from Common Crawl 4. (v) LLa VAMed [67] for biomedical image understanding. |
| Dataset Splits | Yes | We utilize high-quality visual instruction tuning data that build upon LLa VA-665K [28] for finetuning. Additionally, we integrate several visual question answering datasets across various domains, such as Doc VQA [17], Chart QA [18], Infographic VQA [68], AI2D [69], ST-VQA [70], Text VQA [71], Synth Do G-en [72], Geometry3K [73], PGPS9K [74], Geo170K [75], Ref COCO, LLa VA-Med, VQA-RAD [76], and SLAKE [22]. We also encompass equivalent comprehensive captions [52, 57, 77, 78] generated by the advanced GPT4-V [54] for improved world knowledge. Apart from the above instruction tuning data, we convert the selected 2K routing annotations to instructions and incorporate them into the training data. |
| Hardware Specification | Yes | The Mo VA models with Vicuna-7B and LLama3-8B are pretrained using 64 A100 80G GPUs for 2 days, and finetuned using 32 A100 80G GPUs for 1 day. The Mo VA with 34B LLM is pretrained using 128 A100 80G GPUs for 5 days and finetuned using 64 A100 80G GPUs for 2 days. |
| Software Dependencies | No | The paper mentions using 'bfloat16 and flash-attention 2' but does not specify version numbers for these or other key software components like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | In the pretraining stage, we use the Adam W optimizer with an initial learning rate of 2 × 10−4, a batch size of 1024, and train the model for 1 epoch. We jointly finetune the weights of all components except additional vision experts with a batch size of 128 and an initial learning rate of 2 × 10−5 during supervised finetuning. |