MoVA: Adapting Mixture of Vision Experts to Multimodal Context

Authors: ZHUOFAN ZONG, Bingqi Ma, Dazhong Shen, Guanglu Song, Hao Shao, DONGZHI JIANG, Hongsheng Li, Yu Liu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments to evaluate the effectiveness of the proposed approach. Without any bells and whistles, Mo VA can achieve significant performance gains over current state-of-the-art methods in a wide range of challenging multimodal benchmarks.
Researcher Affiliation Collaboration Zhuofan Zong1,2, Bingqi Ma2, Dazhong Shen3 Guanglu Song2 Hao Shao1 Dongzhi Jiang1 Hongsheng Li1,3,4, Yu Liu2, 1CUHK MMLab 2Sense Time Research 3Shanghai AI Laboratory 4CPII under Inno HK
Pseudocode No The paper describes the methodology in detail but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Codes and models are available at https://github.com/Temple X98/Mo VA.
Open Datasets Yes In the pretraining stage, we first construct 15M visual instruction samples across diverse domains as the training data: (i) Image caption data that covers 4M randomly selected samples from Data Comp-1B [56], Share GPT4V-PT [52], and ALLa VA-4V [57]. (ii) Visual grounding and localization dataset that encompasses Objects365 [58], Ref COCO [21], Visual Genome [59], Point QA [60], and Flickr30K [61]. (iii) Chart understanding data that includes MMC-Instruction [62], Chart2Text [63], DVQA [64], and Sci Graph QA [65]. (iv) Text recognition and document parsing data that covers LLa VAR-PT [66] and 3M English document images from Common Crawl 4. (v) LLa VAMed [67] for biomedical image understanding.
Dataset Splits Yes We utilize high-quality visual instruction tuning data that build upon LLa VA-665K [28] for finetuning. Additionally, we integrate several visual question answering datasets across various domains, such as Doc VQA [17], Chart QA [18], Infographic VQA [68], AI2D [69], ST-VQA [70], Text VQA [71], Synth Do G-en [72], Geometry3K [73], PGPS9K [74], Geo170K [75], Ref COCO, LLa VA-Med, VQA-RAD [76], and SLAKE [22]. We also encompass equivalent comprehensive captions [52, 57, 77, 78] generated by the advanced GPT4-V [54] for improved world knowledge. Apart from the above instruction tuning data, we convert the selected 2K routing annotations to instructions and incorporate them into the training data.
Hardware Specification Yes The Mo VA models with Vicuna-7B and LLama3-8B are pretrained using 64 A100 80G GPUs for 2 days, and finetuned using 32 A100 80G GPUs for 1 day. The Mo VA with 34B LLM is pretrained using 128 A100 80G GPUs for 5 days and finetuned using 64 A100 80G GPUs for 2 days.
Software Dependencies No The paper mentions using 'bfloat16 and flash-attention 2' but does not specify version numbers for these or other key software components like Python, PyTorch, or CUDA.
Experiment Setup Yes In the pretraining stage, we use the Adam W optimizer with an initial learning rate of 2 × 10−4, a batch size of 1024, and train the model for 1 epoch. We jointly finetune the weights of all components except additional vision experts with a batch size of 128 and an initial learning rate of 2 × 10−5 during supervised finetuning.