Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback

Authors: Jiaming Ji, Xinyu Chen, Rui Pan, Han Zhu, Jiahao Li, Donghai Hong, Boyuan Chen, Jiayi Zhou, Kaile Wang, Juntao Dai, Chi-Min Chan, Sirui Han, Yike Guo, Yaodong Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we seek to address the following questions: Q1: How does Safe RLHF-V compare to existing multimodal alignment methods? Q2: How robust are reward and cost models in Safe RLHF-V with respect to preference data? Q3: Lagrange multipliers regulate the trade-off in the min-max optimization. Does the algorithm perform well for different values of multipliers? 5.1 Experiment Setup Datasets and Models. To ensure the safety of MLLMs, we use Beavertails-V as the training dataset. Specifically, we utilize the helpfulness and safety preference to train the RM-V and CM-V, respectively. Our experiments are conducted on various MLLMs: Llava-7B-(1.5 and 1.6) and Qwen2-VL-7B. Furthermore, we fine-tune the original model using RLHF [7] and DPO [8] with single-dimension annotations and compare them against RLHF-V [13], Llava-RLHF [14], and MM-RLHF [55] as the comparison baseline. Evaluation Metrics. Given the absence of publicly available multimodal evaluation datasets that simultaneously assess helpfulness and safety, we constructed our own evaluation set of Beaver Tails-V. We employ the win rate metric to quantify improvements in model performance. Model outputs are evaluated using GPT-4o, and the specific prompts utilized for evaluation are detailed in Appendix C. 5.2 Main Results We use win rate as the key metric, and the pairwise comparisons of model outputs (assessed via GPT4o) indicate overall improvement. As shown in Table 4, Safe RLHF-V delivers the best performance in optimizing helpfulness while ensuring compliance with safety constraints. 5.3 Ablation Study How much preference data do RM-V and CM-V require? RM-V and CM-V serve as essential optimization signals within the Safe RLHF-V pipeline. To assess model accuracy across different data scales, we randomly selected varying proportions of preference data to train RM-V and CM-V separately. As shown in Table 5, our results indicate that the multimodal reward model achieves 82.2% accuracy with just 5K preference data points. 5.4 The Training Curve and Budget Bound Analysis As shown in Fig. 3, we illustrate the evolution of λ during training and its correlation with reward and cost, offering clearer insights into the constrained min-max optimization principle of Safe RLHF-V.
Researcher Affiliation Academia Jiaming Ji1,2, Xinyu Chen1, Rui Pan1, Han Zhu3, Jiahao Li1, Donghai Hong1, Boyuan Chen1, Jiayi Zhou1,2, Kaile Wang1, Juntao Dai1, Chi-Min Chan3, Sirui Han3, Yike Guo3, and Yaodong Yang1,* 1Institute for Artificial Intelligence, Peking University 2State Key Laboratory of General Artificial Intelligence, Peking University 3Hong Kong University of Science and Technology
Pseudocode No The paper describes algorithms and methods using mathematical equations (e.g., equations 1-15) and textual explanations, but no distinct, structured pseudocode blocks or algorithm listings are present.
Open Source Code Yes All of datasets, models, and code can be found at https://github.com/Safe RLHF-V. All datasets, models, and code have been open-sourced. We hope this work can facilitate the safety alignment of MLLMs, thereby mitigating their potential societal risks.
Open Datasets Yes (Dataset) We have open-sourced Beaver Tails-V, the first dataset featuring dual preference annotations for both helpfulness and safety in MLLMs. For each pair, we independently annotated preferences regarding helpfulness and safety. Additionally, we provided graded safety labels: minor, moderate, and severe to reduce the inconsistencies in labeling, establishing the first exploration to enable multi-level safety alignment in MLLMs.
Dataset Splits No The paper states, "To ensure the safety of MLLMs, we use Beavertails-V as the training dataset" and "we constructed our own evaluation set of Beaver Tails-V." It also mentions, "We randomly sampled 2,000 preference annotation instances from outputs generated by a variety of models" for agreement evaluation, and "We randomly selected varying proportions of preference data to train RM-V and CM-V separately." However, it does not provide specific percentages or counts for training, testing, or validation splits for the main experiments, nor does it refer to standard predefined splits for Beaver Tails-V with explicit information.
Hardware Specification No The paper mentions using specific models like "Llava-7B-(1.5 and 1.6)" and "Qwen2-VL-7B" and training on these, but it does not specify the underlying hardware (e.g., GPU models, CPU types, or memory) used for these experiments.
Software Dependencies No The paper refers to algorithms like Proximal Policy Optimization (PPO) and models like GPT-4o, Ovis1.6-Gemma2-9B, Phi-3.5-vision-instruct, and paraphrase-Mini LM-L6-v2. However, it does not explicitly list any software dependencies (e.g., programming languages, libraries, or frameworks) with their specific version numbers that would be required to reproduce the experiments.
Experiment Setup Yes A Implementation Details A.1 Preference Models We initialize our reward model (RM-V) and cost model (CM-V) using the pre-trained model, Llava1.5-7B [3]. During the training phase, we employ the loss functions presented in equations (4) and (5). Additionally, we incorporate an extra regularization term within the loss function to enhance generalization and stabilize the training process. A.2 Details of RLHF Training Following the training paradigm proposed by [7] ([7]), we use reinforcement learning from human feedback (RLHF) to optimize our model. The training objective consists of two key components: the RL objective and the PTX pretraining objective. The RL objective is guided by a reward model-vision (RM-V), with an additional per-token KL penalty to constrain policy updates and ensure stable learning. During RL training, given a prompt x Dprompt, the current policy model πθ(y|x) generates a response sequence y = a1:T , where T represents the response length. To stabilize training, we utilize a reference model πref( |x), which is used to compute the KL divergence and regularize the reward signal. For RLHF fine-tuning, we adopt the Proximal Policy Optimization (PPO) algorithm ([35]), employing a clipped surrogate loss formulation: LRL(θ; Dprompt) = Ex Dprompt,y πθ(y|x) h Et h min ρt(θ) ˆAˆrt, clip (ρt(θ), 1 ϵ, 1 + ϵ) ˆAˆrt ii where θold represents the model parameters from the previous update, and λ (0, 1) is the PPO clipping coefficient. The advantage estimate At is computed using Generalized Advantage Estimation (GAE) ([56]). In addition to the RL objective, we incorporate a PTX objective to preserve model knowledge and stability. Since pretraining data is inaccessible, we utilize a Supervised Fine-Tuning (SFT) dataset to compute the PTX loss, ensuring that the model s performance on generation tasks remains unaffected by RL optimization. We utilize the Align-Anything-TI2T-Instruction-100K Dataset ([57]) for PTX optimization. The total training loss during the RLHF phase is defined as follows: LRLHF(θ; Dprompt, DSFT) = LRL(θ; Dprompt) + γ LPTX(θ; DSFT). (10) where γ represents the PTX loss coefficient. A.3 Details of Safe RLHF-V Training Similar to the Safe RLHF training process proposed by [18], Safe RLHF-V iteratively solves the minimax problem in equation (7) by alternately updating the model parameters θ and the Lagrange multipliers λ. We incorporate the KL reward into both the reward rt and the cost ˆct, and normalize these two loss terms with a factor of (1 + λ): LSafe RL R (θ; Dprompt) = Ex Dprompt,y πθ(y|x) h Et h min ρt(θ) ˆAˆrt, clip (ρt(θ), 1 ϵ, 1 + ϵ) ˆAˆrt ii , (11) LSafe RL C (θ; Dprompt) = Ex Dprompt,y πθ(y|x) h Et h min ρt(θ) ˆAˆct, clip (ρt(θ), 1 ϵ, 1 + ϵ) ˆAˆct ii , (12) LSafe RL(θ; Dprompt) = 1 1 + λ LSafe RL R (θ; Dprompt) λ LSafe RL C (θ; Dprompt) , (13) where ˆAr and ˆAc are the advantage values of the reward and cost, respectively, estimated using the GAE method. The update rules for the model parameters θ and the Lagrange multipliers λ are derived as: θk+1 = θk η 1 + λk θk LSafe RL R (θk) λk LSafe RL C (θk) ηγ θk LPTX(θk), (14) ln λk+1 = ln λk + α λk JC(θk), (15) where η and α represent the learning rates, and LP T X and γ are the PTX loss and its coefficient, respectively, as defined in equation (10). During the Safe RLHF-V training process, we maintain a moving average of the cost model s output to estimate the value of Jc(θk).