MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Authors: Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, Lichao Sun

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparison, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking.
Researcher Affiliation Academia 1Huazhong University of Science and Technology 2Zhejiang University of Technology 3LAIR Lab, Lehigh University.
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes The code and dataset are publicly available at our project homepage: https://mllm-judge.github.io/.
Open Datasets Yes The code and dataset are publicly available at our project homepage: https://mllm-judge.github.io/. We meticulously curate a dataset consisting of 4,414 image-text pairs, gathered from a variety of downstream task datasets, as detailed in Table 8 in Appendix B.
Dataset Splits No The paper describes partitioning its collected data into Dscore, Dpair, and Dbatch for task evaluations, but does not specify traditional training, validation, and test splits with percentages or counts for its own experimental setup.
Hardware Specification Yes We collect responses by inference on a dual-4090 local server.
Software Dependencies No The paper mentions using specific MLLMs (e.g., GPT-4V, LLa VA-1.5-13b) and their parameters, but does not provide specific version numbers for underlying software dependencies like programming languages or deep learning frameworks.
Experiment Setup Yes We set the temperature and top-p as 0.9, max-token as 2048. (for GPT-4V as judge). For LLa VA-1.5-13b... We set temperature as 0, tok-p as 1, max-token as 2048, and beam search number as 3.