Efficient Large Multi-modal Models via Visual Context Compression

Authors: Jieneng Chen, Luoxin Ye, Ju He, Zhaoyang Wang, Daniel Khashabi, Alan L. Yuille

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our approach enhances the performance of MLLMs in both image-language and video-language understanding, while also significantly cutting training costs and improving inference efficiency.
Researcher Affiliation Academia Jieneng Chen , Luoxin Ye , Ju He , Zhao-Yang Wang, Daniel Khashabi , Alan Yuille Johns Hopkins University
Pseudocode No The paper does not include any figure, block, or section labeled 'Pseudocode', 'Algorithm', or 'Algorithm X', nor does it present structured steps for a method or procedure formatted like code.
Open Source Code Yes Website https://beckschen.github.io/llavolta.html Code https://github.com/Beckschen/LLa Volta
Open Datasets Yes We adopt thirteen benchmarks specifically designed for MLLM evaluation, including GQA [20], MM-Vet [50], Science QA (SQA)[31], MME[13], Text VQA [39], POPE [24], MMBench [30], MMBench-CN [30], VQA-v2 [14], LLa VA-Bench-in-the-Wild (LLa VAW ) [28], Vis Wiz [15], SEED-Image [22] and MMMU [52].
Dataset Splits Yes We follow LLa VA-1.5 [27] to perform data preparation and training schedule for pretraining and instruction tuning. We adopt thirteen benchmarks specifically designed for MLLM evaluation, including GQA [20], MM-Vet [50], Science QA (SQA)[31], MME[13], Text VQA [39], POPE [24], MMBench [30], MMBench-CN [30], VQA-v2 [14], LLa VA-Bench-in-the-Wild (LLa VAW ) [28], Vis Wiz [15], SEED-Image [22] and MMMU [52].
Hardware Specification Yes We conduct all the experiments with the machine of 8 Nvidia RTX 6000 Ada.
Software Dependencies No The paper mentions 'Vicuna-v1.5-7B [10]' and 'LLa MA2 codebase [43]' and 'Deep Speed Ze RO-3', but does not provide specific version numbers for these or other general software dependencies (e.g., Python, PyTorch libraries) that would ensure reproducibility.
Experiment Setup Yes We adopt the Vicuna-v1.5-7B [10] as the language model, leveraging the LLa MA2 codebase [43]. We leverage the pre-trained CLIP Vi T-L/14 [12, 36] with an input resolution of 336 × 336, resulting in 576 visual tokens. We employ the LLa VA framework [27] to connect the frozen CLIP vision encoder and the Vicuna LLMs. Along with the projector, we train the entire LLM instead of parameterefficient finetuning. We follow LLa VA-1.5 [27] to perform data preparation and training schedule for pretraining and instruction tuning.