Are We on the Right Way for Evaluating Large Vision-Language Models?
Authors: Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, Feng Zhao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain. |
| Researcher Affiliation | Collaboration | Lin Chen1,3 Jinsong Li2,3 Xiaoyi Dong2,3 Pan Zhang3 Yuhang Zang3 Zehui Chen1 Haodong Duan3 Jiaqi Wang3 Yu Qiao3 Dahua Lin2,3,4 Feng Zhao1 1 University of Science and Technology of China 2 The Chinese University of Hong Kong 3 Shanghai AI Laboratory 4 CPII under Inno HK |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | All experiments in this study are conducted within the same codebase modified from VLMEval Kit [11], and utilize NVIDIA A100 GPUs for non-API-based evaluation. VLMEval Kit [11] refers to Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass. |
| Open Datasets | Yes | We first choose two benchmarks [34, 27] focused on natural images and four centered on scientific and technical knowledge [64, 38, 26, 37] for our sample collection. |
| Dataset Splits | No | The paper mentions using existing benchmarks like MMMU-Val, which have predefined validation sets. However, it does not provide details on its own train/validation splits for its experiments or for the newly curated MMStar benchmark. |
| Hardware Specification | Yes | All experiments in this study are conducted within the same codebase modified from VLMEval Kit [11], and utilize NVIDIA A100 GPUs for non-API-based evaluation. |
| Software Dependencies | No | The paper mentions using 'VLMEval Kit [15]' and 'VLMEval Kit [11]' but does not specify version numbers for these toolkits or any other software dependencies like Python or PyTorch versions. |
| Experiment Setup | Yes | For evaluating LLMs on existing benchmarks, we employ both 0-shot and 2-shot strategies and will specify which is utilized when reporting results. For evaluating LLMs on MMStar, the 0-shot strategy yields poor scores, making comparisons difficult. Therefore, we exclusively utilize the 2-shot strategy to decrease the frequency of refusal to answer. Moreover, All LVLMs are evaluated utilizing the 0-shot strategy across all benchmarks to ensure a fair comparison. |