Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
Authors: Yifan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Tingting Gao, Zhang Zhang, Fan Yang, Di Zhang, Liang Wang, Rong Jin
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach is rigorously evaluated across 10 distinct dimensions, encompassing 27 benchmarks, with results demonstrating significant and consistent improvements in model performance (Figure. 1). We conduct extensive evaluations across ten key dimensions, covering 27 benchmarks. The results demonstrate that our training algorithm, combined with the high-quality MM-RLHF dataset, leads to significant improvements in model performance. |
| Researcher Affiliation | Collaboration | 1Institute of Automation, School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing, China 2Nanjing University (NJU), Nanjing, China 3Kuai Shou, Beijing, China. |
| Pseudocode | No | The paper describes the Critique-Based Reward Model and Dynamic Reward Scaling in prose and mathematical formulations, but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper refers to using state-of-the-art models from both open-source and closed-source domains, and mentions 'MM-RLHF-Reward-7B achieves SOTA performance on several benchmarks among open-source models'. However, it does not contain an explicit statement or link confirming the release of the code for the methodology described in this paper. |
| Open Datasets | No | At first, we introduce MM-RLHF, a dataset designed to advance Multimodal Reinforcement Learning from Human Feedback (RLHF). The dataset spans three key domains: image understanding, video understanding, and MLLM safety. To evaluate the effectiveness of the signals provided by our reward model in guiding subsequent model training, we randomly sample 10 examples from each category of the MM-RLHF dataset to create a test set. |
| Dataset Splits | Yes | To evaluate the effectiveness of the signals provided by our reward model in guiding subsequent model training, we randomly sample 10 examples from each category of the MM-RLHF dataset to create a test set. For ablation studies, we uniformly sample 1/5 of the data, which may result in minor performance discrepancies compared to the full dataset. |
| Hardware Specification | Yes | All experiments are conducted on a high-performance computing cluster equipped with 32 H800 (80G) GPUs. |
| Software Dependencies | No | The paper describes the implementation details of MM-DPO and the parameters for SFT loss and learning rate, but does not specify software dependencies with version numbers (e.g., Python, PyTorch versions or other libraries). |
| Experiment Setup | Yes | In the implementation of MM-DPO, we adopt a common stabilization technique by incorporating an SFT loss. The weight of the SFT loss is selected through a grid search over the values {0, 0.1, 0.25, 0.5, 1.0}. Additionally, the learning rate is optimized via a search over {1e-7, 5e-7, 1e-6, 5e-6, 1e-5} to identify the best-performing configuration. Since we dynamically adjust the Ξ² parameter during training, the initial value of Ξ²ori is set to a small default value of 0.1, eliminating the need for manual tuning. Throughout all training processes, the vision encoder remains frozen to ensure stable and efficient training. |