reproducibilityindex.ai

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Authors: Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, Lichao Sun

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparison, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking.
Researcher Affiliation	Academia	1Huazhong University of Science and Technology 2Zhejiang University of Technology 3LAIR Lab, Lehigh University.
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	The code and dataset are publicly available at our project homepage: https://mllm-judge.github.io/.
Open Datasets	Yes	The code and dataset are publicly available at our project homepage: https://mllm-judge.github.io/. We meticulously curate a dataset consisting of 4,414 image-text pairs, gathered from a variety of downstream task datasets, as detailed in Table 8 in Appendix B.
Dataset Splits	No	The paper describes partitioning its collected data into Dscore, Dpair, and Dbatch for task evaluations, but does not specify traditional training, validation, and test splits with percentages or counts for its own experimental setup.
Hardware Specification	Yes	We collect responses by inference on a dual-4090 local server.
Software Dependencies	No	The paper mentions using specific MLLMs (e.g., GPT-4V, LLa VA-1.5-13b) and their parameters, but does not provide specific version numbers for underlying software dependencies like programming languages or deep learning frameworks.
Experiment Setup	Yes	We set the temperature and top-p as 0.9, max-token as 2048. (for GPT-4V as judge). For LLa VA-1.5-13b... We set temperature as 0, tok-p as 1, max-token as 2048, and beam search number as 3.