Dense Connector for MLLMs

Authors: Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, Jingdong Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results across various vision encoders, image resolutions, training dataset scales, varying sizes of LLMs (2.7B 70B), and diverse architectures of MLLMs (e.g., LLa VAv1.5, LLa VA-Ne XT and Mini-Gemini) validate the versatility and scalability of our approach, achieving state-of-the-art performance across 19 image and video benchmarks.
Researcher Affiliation Collaboration Huanjin Yao1,3*, Wenhao Wu2* , Taojiannan Yang4, Yuxin Song3, Mengxi Zhang3 Haocheng Feng3, Yifan Sun3, Zhiheng Li1 , Wanli Ouyang5, Jingdong Wang3 1Shenzhen International Graduate School, Tsinghua University 2The University of Sydney 3Baidu Inc. 4Amazon 5 The Chinese University of Hong Kong
Pseudocode No The paper provides architectural diagrams and mathematical equations but does not include any pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/HJYao00/Dense Connector.
Open Datasets Yes Training Datasets. Data quality plays a crucial role in determining the performance of MLLMs. In this study, we examine the impact of two high-quality training datasets on our model: LLa VA-1.5 [16] and Mini-Gemini [18].
Dataset Splits No The paper describes the datasets used for pre-training and instruction tuning, providing counts for these. However, it does not specify a distinct 'validation' dataset split used during training for hyperparameter tuning or early stopping, which is common in machine learning experiments. Evaluation is done on separate benchmark datasets.
Hardware Specification Yes We train all models on 8 NVDIA A100 GPUs with 40GB VRAM, except for the Hermes-2-Yi-34B and LLama-3-70B-Instruct, which utilize 32 NVDIA A100 GPUs with 80GB VRAM.
Software Dependencies No The paper mentions using specific models and components like CLIP, ViT, and LLMs (e.g., Vicuna, Llama3), but it does not specify software dependencies like Python, PyTorch, or TensorFlow with their exact version numbers.
Experiment Setup Yes Our training process comprises two stages: pre-training and instruction fine-tuning. In the pre-training phase, we initialize the visual encoder and LLM with pre-trained weights, while the Dense Connector is randomly initialized. Here, we freeze the visual encoder and the LLM, updating only the parameters of the Dense Connector. The model undergoes pre-training for one epoch with a global batch size of 256 and a learning rate of 1e-3. Subsequently, in the instruction fine-tuning stage, we maintain the visual encoder frozen while updating the Dense Connector and the LLM. Fine-tuning is performed for 1 epoch with a global batch size of 128 and a learning rate of 2e-5. For models using Lo RA fine-tuning, we set the Lo RA rank to 128 and Lo RA alpha to 256.