Dense Connector for MLLMs
Authors: Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, Jingdong Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results across various vision encoders, image resolutions, training dataset scales, varying sizes of LLMs (2.7B 70B), and diverse architectures of MLLMs (e.g., LLa VAv1.5, LLa VA-Ne XT and Mini-Gemini) validate the versatility and scalability of our approach, achieving state-of-the-art performance across 19 image and video benchmarks. |
| Researcher Affiliation | Collaboration | Huanjin Yao1,3*, Wenhao Wu2* , Taojiannan Yang4, Yuxin Song3, Mengxi Zhang3 Haocheng Feng3, Yifan Sun3, Zhiheng Li1 , Wanli Ouyang5, Jingdong Wang3 1Shenzhen International Graduate School, Tsinghua University 2The University of Sydney 3Baidu Inc. 4Amazon 5 The Chinese University of Hong Kong |
| Pseudocode | No | The paper provides architectural diagrams and mathematical equations but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/HJYao00/Dense Connector. |
| Open Datasets | Yes | Training Datasets. Data quality plays a crucial role in determining the performance of MLLMs. In this study, we examine the impact of two high-quality training datasets on our model: LLa VA-1.5 [16] and Mini-Gemini [18]. |
| Dataset Splits | No | The paper describes the datasets used for pre-training and instruction tuning, providing counts for these. However, it does not specify a distinct 'validation' dataset split used during training for hyperparameter tuning or early stopping, which is common in machine learning experiments. Evaluation is done on separate benchmark datasets. |
| Hardware Specification | Yes | We train all models on 8 NVDIA A100 GPUs with 40GB VRAM, except for the Hermes-2-Yi-34B and LLama-3-70B-Instruct, which utilize 32 NVDIA A100 GPUs with 80GB VRAM. |
| Software Dependencies | No | The paper mentions using specific models and components like CLIP, ViT, and LLMs (e.g., Vicuna, Llama3), but it does not specify software dependencies like Python, PyTorch, or TensorFlow with their exact version numbers. |
| Experiment Setup | Yes | Our training process comprises two stages: pre-training and instruction fine-tuning. In the pre-training phase, we initialize the visual encoder and LLM with pre-trained weights, while the Dense Connector is randomly initialized. Here, we freeze the visual encoder and the LLM, updating only the parameters of the Dense Connector. The model undergoes pre-training for one epoch with a global batch size of 256 and a learning rate of 1e-3. Subsequently, in the instruction fine-tuning stage, we maintain the visual encoder frozen while updating the Dense Connector and the LLM. Fine-tuning is performed for 1 epoch with a global batch size of 128 and a learning rate of 2e-5. For models using Lo RA fine-tuning, we set the Lo RA rank to 128 and Lo RA alpha to 256. |