MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Authors: Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, Lijuan Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose MM-Vet1, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks... We evaluate representative LMMs on MMVet, providing insights into the capabilities of different LMM system paradigms and model designs.
Researcher Affiliation Collaboration 1National University of Singapore, Singapore 2Microsoft Azure AI, USA.
Pseudocode No The paper does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code and data are available at https:// github.com/yuweihao/MM-Vet, and the online evaluator at https://huggingface. co/spaces/whyu/MM-Vet_Evaluator.
Open Datasets Yes In addition to the 187 images, ten extra images with highquality questions are collected from VCR (Zellers et al., 2019), with the questions and answers modified to an openended answering format. Another three images are from Chest X-ray14 (Wang et al., 2017) to obtain corresponding medical expert knowledge.
Dataset Splits No The paper describes MM-Vet as an evaluation benchmark consisting of 218 samples and does not specify training, validation, or test splits for its own experimental setup; it uses the entire dataset for evaluation.
Hardware Specification No The paper does not provide specific details regarding the hardware (e.g., GPU models, CPU types) used for running the experiments or evaluations.
Software Dependencies No The paper mentions software components like GPT-4, LMMs, and external tools (e.g., Azure API, Bing search, PAL math tool) but does not provide specific version numbers for these dependencies.
Experiment Setup Yes As shown in Table 1, for each sample, we fill the prompt template with its question, ground truth, and output from a specific LMM. By taking the filled prompt into GPT-4, GPT-4 will generate a score from 0 to 1 for the sample. It is found that outputs of GPT-4 still exist variance, although the temperature is set as 0. Therefore, we utilize GPT-4 to evaluate the outputs of LLMs by 5 times.