Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Authors: Yifan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Tingting Gao, Zhang Zhang, Fan Yang, Di Zhang, Liang Wang, Rong Jin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach is rigorously evaluated across 10 distinct dimensions, encompassing 27 benchmarks, with results demonstrating significant and consistent improvements in model performance (Figure. 1). We conduct extensive evaluations across ten key dimensions, covering 27 benchmarks. The results demonstrate that our training algorithm, combined with the high-quality MM-RLHF dataset, leads to significant improvements in model performance.
Researcher Affiliation Collaboration 1Institute of Automation, School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing, China 2Nanjing University (NJU), Nanjing, China 3Kuai Shou, Beijing, China.
Pseudocode No The paper describes the Critique-Based Reward Model and Dynamic Reward Scaling in prose and mathematical formulations, but does not provide structured pseudocode or algorithm blocks.
Open Source Code No The paper refers to using state-of-the-art models from both open-source and closed-source domains, and mentions 'MM-RLHF-Reward-7B achieves SOTA performance on several benchmarks among open-source models'. However, it does not contain an explicit statement or link confirming the release of the code for the methodology described in this paper.
Open Datasets No At first, we introduce MM-RLHF, a dataset designed to advance Multimodal Reinforcement Learning from Human Feedback (RLHF). The dataset spans three key domains: image understanding, video understanding, and MLLM safety. To evaluate the effectiveness of the signals provided by our reward model in guiding subsequent model training, we randomly sample 10 examples from each category of the MM-RLHF dataset to create a test set.
Dataset Splits Yes To evaluate the effectiveness of the signals provided by our reward model in guiding subsequent model training, we randomly sample 10 examples from each category of the MM-RLHF dataset to create a test set. For ablation studies, we uniformly sample 1/5 of the data, which may result in minor performance discrepancies compared to the full dataset.
Hardware Specification Yes All experiments are conducted on a high-performance computing cluster equipped with 32 H800 (80G) GPUs.
Software Dependencies No The paper describes the implementation details of MM-DPO and the parameters for SFT loss and learning rate, but does not specify software dependencies with version numbers (e.g., Python, PyTorch versions or other libraries).
Experiment Setup Yes In the implementation of MM-DPO, we adopt a common stabilization technique by incorporating an SFT loss. The weight of the SFT loss is selected through a grid search over the values {0, 0.1, 0.25, 0.5, 1.0}. Additionally, the learning rate is optimized via a search over {1e-7, 5e-7, 1e-6, 5e-6, 1e-5} to identify the best-performing configuration. Since we dynamically adjust the Ξ² parameter during training, the initial value of Ξ²ori is set to a small default value of 0.1, eliminating the need for manual tuning. Throughout all training processes, the vision encoder remains frozen to ensure stable and efficient training.