MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Authors: Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, Lijuan Wang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose MM-Vet1, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks... We evaluate representative LMMs on MMVet, providing insights into the capabilities of different LMM system paradigms and model designs. |
| Researcher Affiliation | Collaboration | 1National University of Singapore, Singapore 2Microsoft Azure AI, USA. |
| Pseudocode | No | The paper does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Code and data are available at https:// github.com/yuweihao/MM-Vet, and the online evaluator at https://huggingface. co/spaces/whyu/MM-Vet_Evaluator. |
| Open Datasets | Yes | In addition to the 187 images, ten extra images with highquality questions are collected from VCR (Zellers et al., 2019), with the questions and answers modified to an openended answering format. Another three images are from Chest X-ray14 (Wang et al., 2017) to obtain corresponding medical expert knowledge. |
| Dataset Splits | No | The paper describes MM-Vet as an evaluation benchmark consisting of 218 samples and does not specify training, validation, or test splits for its own experimental setup; it uses the entire dataset for evaluation. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware (e.g., GPU models, CPU types) used for running the experiments or evaluations. |
| Software Dependencies | No | The paper mentions software components like GPT-4, LMMs, and external tools (e.g., Azure API, Bing search, PAL math tool) but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | As shown in Table 1, for each sample, we fill the prompt template with its question, ground truth, and output from a specific LMM. By taking the filled prompt into GPT-4, GPT-4 will generate a score from 0 to 1 for the sample. It is found that outputs of GPT-4 still exist variance, although the temperature is set as 0. Therefore, we utilize GPT-4 to evaluate the outputs of LLMs by 5 times. |