Efficient Large Multi-modal Models via Visual Context Compression
Authors: Jieneng Chen, Luoxin Ye, Ju He, Zhaoyang Wang, Daniel Khashabi, Alan L. Yuille
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our approach enhances the performance of MLLMs in both image-language and video-language understanding, while also significantly cutting training costs and improving inference efficiency. |
| Researcher Affiliation | Academia | Jieneng Chen , Luoxin Ye , Ju He , Zhao-Yang Wang, Daniel Khashabi , Alan Yuille Johns Hopkins University |
| Pseudocode | No | The paper does not include any figure, block, or section labeled 'Pseudocode', 'Algorithm', or 'Algorithm X', nor does it present structured steps for a method or procedure formatted like code. |
| Open Source Code | Yes | Website https://beckschen.github.io/llavolta.html Code https://github.com/Beckschen/LLa Volta |
| Open Datasets | Yes | We adopt thirteen benchmarks specifically designed for MLLM evaluation, including GQA [20], MM-Vet [50], Science QA (SQA)[31], MME[13], Text VQA [39], POPE [24], MMBench [30], MMBench-CN [30], VQA-v2 [14], LLa VA-Bench-in-the-Wild (LLa VAW ) [28], Vis Wiz [15], SEED-Image [22] and MMMU [52]. |
| Dataset Splits | Yes | We follow LLa VA-1.5 [27] to perform data preparation and training schedule for pretraining and instruction tuning. We adopt thirteen benchmarks specifically designed for MLLM evaluation, including GQA [20], MM-Vet [50], Science QA (SQA)[31], MME[13], Text VQA [39], POPE [24], MMBench [30], MMBench-CN [30], VQA-v2 [14], LLa VA-Bench-in-the-Wild (LLa VAW ) [28], Vis Wiz [15], SEED-Image [22] and MMMU [52]. |
| Hardware Specification | Yes | We conduct all the experiments with the machine of 8 Nvidia RTX 6000 Ada. |
| Software Dependencies | No | The paper mentions 'Vicuna-v1.5-7B [10]' and 'LLa MA2 codebase [43]' and 'Deep Speed Ze RO-3', but does not provide specific version numbers for these or other general software dependencies (e.g., Python, PyTorch libraries) that would ensure reproducibility. |
| Experiment Setup | Yes | We adopt the Vicuna-v1.5-7B [10] as the language model, leveraging the LLa MA2 codebase [43]. We leverage the pre-trained CLIP Vi T-L/14 [12, 36] with an input resolution of 336 × 336, resulting in 576 visual tokens. We employ the LLa VA framework [27] to connect the frozen CLIP vision encoder and the Vicuna LLMs. Along with the projector, we train the entire LLM instead of parameterefficient finetuning. We follow LLa VA-1.5 [27] to perform data preparation and training schedule for pretraining and instruction tuning. |