MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
Authors: Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, Lichao Sun
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparison, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking. |
| Researcher Affiliation | Academia | 1Huazhong University of Science and Technology 2Zhejiang University of Technology 3LAIR Lab, Lehigh University. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | The code and dataset are publicly available at our project homepage: https://mllm-judge.github.io/. |
| Open Datasets | Yes | The code and dataset are publicly available at our project homepage: https://mllm-judge.github.io/. We meticulously curate a dataset consisting of 4,414 image-text pairs, gathered from a variety of downstream task datasets, as detailed in Table 8 in Appendix B. |
| Dataset Splits | No | The paper describes partitioning its collected data into Dscore, Dpair, and Dbatch for task evaluations, but does not specify traditional training, validation, and test splits with percentages or counts for its own experimental setup. |
| Hardware Specification | Yes | We collect responses by inference on a dual-4090 local server. |
| Software Dependencies | No | The paper mentions using specific MLLMs (e.g., GPT-4V, LLa VA-1.5-13b) and their parameters, but does not provide specific version numbers for underlying software dependencies like programming languages or deep learning frameworks. |
| Experiment Setup | Yes | We set the temperature and top-p as 0.9, max-token as 2048. (for GPT-4V as judge). For LLa VA-1.5-13b... We set temperature as 0, tok-p as 1, max-token as 2048, and beam search number as 3. |