Are We on the Right Way for Evaluating Large Vision-Language Models?

Authors: Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, Feng Zhao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.
Researcher Affiliation Collaboration Lin Chen1,3 Jinsong Li2,3 Xiaoyi Dong2,3 Pan Zhang3 Yuhang Zang3 Zehui Chen1 Haodong Duan3 Jiaqi Wang3 Yu Qiao3 Dahua Lin2,3,4 Feng Zhao1 1 University of Science and Technology of China 2 The Chinese University of Hong Kong 3 Shanghai AI Laboratory 4 CPII under Inno HK
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes All experiments in this study are conducted within the same codebase modified from VLMEval Kit [11], and utilize NVIDIA A100 GPUs for non-API-based evaluation. VLMEval Kit [11] refers to Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass.
Open Datasets Yes We first choose two benchmarks [34, 27] focused on natural images and four centered on scientific and technical knowledge [64, 38, 26, 37] for our sample collection.
Dataset Splits No The paper mentions using existing benchmarks like MMMU-Val, which have predefined validation sets. However, it does not provide details on its own train/validation splits for its experiments or for the newly curated MMStar benchmark.
Hardware Specification Yes All experiments in this study are conducted within the same codebase modified from VLMEval Kit [11], and utilize NVIDIA A100 GPUs for non-API-based evaluation.
Software Dependencies No The paper mentions using 'VLMEval Kit [15]' and 'VLMEval Kit [11]' but does not specify version numbers for these toolkits or any other software dependencies like Python or PyTorch versions.
Experiment Setup Yes For evaluating LLMs on existing benchmarks, we employ both 0-shot and 2-shot strategies and will specify which is utilized when reporting results. For evaluating LLMs on MMStar, the 0-shot strategy yields poor scores, making comparisons difficult. Therefore, we exclusively utilize the 2-shot strategy to decrease the frequency of refusal to answer. Moreover, All LVLMs are evaluated utilizing the 0-shot strategy across all benchmarks to ensure a fair comparison.