Visual Perception by Large Language Model’s Weights

Authors: Feipeng Ma, Hongwei Xue, Yizhou Zhou, Guangting Wang, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results show that our VLo RA achieves comparable performance on various benchmarks for MLLMs, while significantly reducing the computational costs for both training and inference.
Researcher Affiliation Collaboration 1University of Science and Technology of China 2We Chat, Tencent Inc. 3Show Lab, National University of Singapore 4Fudan University 5Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Pseudocode No The paper does not contain an explicit pseudocode or algorithm block.
Open Source Code Yes Code and models are released at https://github.com/Feipeng Ma6/VLo RA.
Open Datasets Yes During pre-training, we use image-text pairs to train our model. Specifically, we use a subset of Caps Fusion-120M [59] with 30 million image-text pairs. Caps Fusion-120M randomly collects image-text pairs from LAION-COCO [1]... For supervised fine-tuning, we adopt the same data as LLa VA-v1.5. Specifically, the supervised fine-tuning data is constructed with VQAv2 [13], GQA [19], OKVQA [45], OCRVQA [46], A-OKVQA [52], Text Caps [53], Ref COCO [44, 21], Visual Genome [23], Share GPT [2], and LLa VA-Insturct [36], with a total of 665K conversation data.
Dataset Splits Yes Science QA [42] is constructed from elementary and high school science curricula. Questions of Science QA span three subjects: natural science, language science, and social science. We use samples with images from the validation set to evaluate MLLMs.
Hardware Specification Yes The pre-training stage uses 24 NVIDIA H800 GPUs for 7 hours... We train for one epoch on 8 NVIDIA H800 GPUs, which takes 2 hours.
Software Dependencies No The paper mentions using Adam W optimizer but does not specify version numbers for any software dependencies or libraries.
Experiment Setup Yes We use the Adam W [40] optimizer with a learning rate of 5e-5, which follows a linear warm-up and then a cosine decay schedule. The pre-training is conducted with a total batch size of 768 for 40,000 iterations. The input images are resized to a resolution of 336 336... The learning rate is set to 5e-5 and the learning rate schedule is the same as in the pre-training stage. The global batch size is 128.