Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Authors: Yifan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Tingting Gao, Zhang Zhang, Fan Yang, Di Zhang, Liang Wang, Rong Jin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach is rigorously evaluated across 10 distinct dimensions, encompassing 27 benchmarks, with results demonstrating significant and consistent improvements in model performance (Figure. 1). We conduct extensive evaluations across ten key dimensions, covering 27 benchmarks. The results demonstrate that our training algorithm, combined with the high-quality MM-RLHF dataset, leads to significant improvements in model performance.
Researcher Affiliation	Collaboration	1Institute of Automation, School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing, China 2Nanjing University (NJU), Nanjing, China 3Kuai Shou, Beijing, China.
Pseudocode	No	The paper describes the Critique-Based Reward Model and Dynamic Reward Scaling in prose and mathematical formulations, but does not provide structured pseudocode or algorithm blocks.
Open Source Code	No	The paper refers to using state-of-the-art models from both open-source and closed-source domains, and mentions 'MM-RLHF-Reward-7B achieves SOTA performance on several benchmarks among open-source models'. However, it does not contain an explicit statement or link confirming the release of the code for the methodology described in this paper.
Open Datasets	No	At first, we introduce MM-RLHF, a dataset designed to advance Multimodal Reinforcement Learning from Human Feedback (RLHF). The dataset spans three key domains: image understanding, video understanding, and MLLM safety. To evaluate the effectiveness of the signals provided by our reward model in guiding subsequent model training, we randomly sample 10 examples from each category of the MM-RLHF dataset to create a test set.
Dataset Splits	Yes	To evaluate the effectiveness of the signals provided by our reward model in guiding subsequent model training, we randomly sample 10 examples from each category of the MM-RLHF dataset to create a test set. For ablation studies, we uniformly sample 1/5 of the data, which may result in minor performance discrepancies compared to the full dataset.
Hardware Specification	Yes	All experiments are conducted on a high-performance computing cluster equipped with 32 H800 (80G) GPUs.
Software Dependencies	No	The paper describes the implementation details of MM-DPO and the parameters for SFT loss and learning rate, but does not specify software dependencies with version numbers (e.g., Python, PyTorch versions or other libraries).
Experiment Setup	Yes	In the implementation of MM-DPO, we adopt a common stabilization technique by incorporating an SFT loss. The weight of the SFT loss is selected through a grid search over the values {0, 0.1, 0.25, 0.5, 1.0}. Additionally, the learning rate is optimized via a search over {1e-7, 5e-7, 1e-6, 5e-6, 1e-5} to identify the best-performing configuration. Since we dynamically adjust the β parameter during training, the initial value of βori is set to a small default value of 0.1, eliminating the need for manual tuning. Throughout all training processes, the vision encoder remains frozen to ensure stable and efficient training.