Visual Perception by Large Language Model’s Weights
Authors: Feipeng Ma, Hongwei Xue, Yizhou Zhou, Guangting Wang, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results show that our VLo RA achieves comparable performance on various benchmarks for MLLMs, while significantly reducing the computational costs for both training and inference. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China 2We Chat, Tencent Inc. 3Show Lab, National University of Singapore 4Fudan University 5Institute of Artificial Intelligence, Hefei Comprehensive National Science Center |
| Pseudocode | No | The paper does not contain an explicit pseudocode or algorithm block. |
| Open Source Code | Yes | Code and models are released at https://github.com/Feipeng Ma6/VLo RA. |
| Open Datasets | Yes | During pre-training, we use image-text pairs to train our model. Specifically, we use a subset of Caps Fusion-120M [59] with 30 million image-text pairs. Caps Fusion-120M randomly collects image-text pairs from LAION-COCO [1]... For supervised fine-tuning, we adopt the same data as LLa VA-v1.5. Specifically, the supervised fine-tuning data is constructed with VQAv2 [13], GQA [19], OKVQA [45], OCRVQA [46], A-OKVQA [52], Text Caps [53], Ref COCO [44, 21], Visual Genome [23], Share GPT [2], and LLa VA-Insturct [36], with a total of 665K conversation data. |
| Dataset Splits | Yes | Science QA [42] is constructed from elementary and high school science curricula. Questions of Science QA span three subjects: natural science, language science, and social science. We use samples with images from the validation set to evaluate MLLMs. |
| Hardware Specification | Yes | The pre-training stage uses 24 NVIDIA H800 GPUs for 7 hours... We train for one epoch on 8 NVIDIA H800 GPUs, which takes 2 hours. |
| Software Dependencies | No | The paper mentions using Adam W optimizer but does not specify version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | We use the Adam W [40] optimizer with a learning rate of 5e-5, which follows a linear warm-up and then a cosine decay schedule. The pre-training is conducted with a total batch size of 768 for 40,000 iterations. The input images are resized to a resolution of 336 336... The learning rate is set to 5e-5 and the learning rate schedule is the same as in the pre-training stage. The global batch size is 128. |