reproducibilityindex.ai

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Authors: Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, Lijuan Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose MM-Vet1, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks... We evaluate representative LMMs on MMVet, providing insights into the capabilities of different LMM system paradigms and model designs.
Researcher Affiliation	Collaboration	1National University of Singapore, Singapore 2Microsoft Azure AI, USA.
Pseudocode	No	The paper does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Code and data are available at https:// github.com/yuweihao/MM-Vet, and the online evaluator at https://huggingface. co/spaces/whyu/MM-Vet_Evaluator.
Open Datasets	Yes	In addition to the 187 images, ten extra images with highquality questions are collected from VCR (Zellers et al., 2019), with the questions and answers modified to an openended answering format. Another three images are from Chest X-ray14 (Wang et al., 2017) to obtain corresponding medical expert knowledge.
Dataset Splits	No	The paper describes MM-Vet as an evaluation benchmark consisting of 218 samples and does not specify training, validation, or test splits for its own experimental setup; it uses the entire dataset for evaluation.
Hardware Specification	No	The paper does not provide specific details regarding the hardware (e.g., GPU models, CPU types) used for running the experiments or evaluations.
Software Dependencies	No	The paper mentions software components like GPT-4, LMMs, and external tools (e.g., Azure API, Bing search, PAL math tool) but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	As shown in Table 1, for each sample, we fill the prompt template with its question, ground truth, and output from a specific LMM. By taking the filled prompt into GPT-4, GPT-4 will generate a score from 0 to 1 for the sample. It is found that outputs of GPT-4 still exist variance, although the temperature is set as 0. Therefore, we utilize GPT-4 to evaluate the outputs of LLMs by 5 times.