Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Generative RLHF-V: Learning Principles from Multi-modal Human Preference

Authors: Jiayi Zhou, Jiaming Ji, Boyuan Chen, Jiapeng Sun, wenqi chen, Donghai Hong, Sirui Han, Yike Guo, Yaodong Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that, besides out-of-distribution generalization of RM discrimination, our framework improves 4 MLLMs performance across 7 benchmarks by 18.1%, while the baseline RLHF is only 5.3%. Our experiments were conducted on servers equipped with 8 * Nvidia H800 GPUs. We utilized verl2 for RL training and align-anything3 for reward modeling and supervised fine-tuning.
Researcher Affiliation Academia 1Institute for Artificial Intelligence, Peking University 2State Key Laboratory of General Artificial Intelligence, Peking University 3Hong Kong University of Science and Technology 4University College London EMAIL EMAIL
Pseudocode Yes The core code for our implementation is presented as follows: import re from mathruler.grader import extract_boxed_content , grade_answer def acc_reward( predict_str: str , ground_truth: str ) -> float: if \ boxed not in predict_str: return 0.0 ... The implementation of grouped comparison within the Reinforcement Learning (RL) optimization process is somewhat intricate, as detailed below: def compute_score(data_sources: list[str], solution_strs : list[str], , ground_truths: list[float], extra_infos: list[dict] = None) -> , float: # Check for complete responses and assign 0 score to incomplete , ones complete_responses = [ is_complete_response (solution) for solution , in solution_strs] ...
Open Source Code Yes Our code and models can be found at https://generative-rlhf-v.github.io.
Open Datasets Yes For the helpfulness, we utilized a 30k preference dataset from Align-Anything [45], the text-image-to-text part. ... For the harmlessness, we employed Beavertails-V [46] which includes 20 distinct categories of safety-related red-teaming prompts.
Dataset Splits No The paper mentions 'For the helpfulness, we utilized a 30k preference dataset from Align-Anything [45], the text-image-to-text part.' and 'For the harmlessness, we employed Beavertails-V [46] which includes 20 distinct categories of safety-related red-teaming prompts.' However, it does not specify the training, validation, or test splits for these datasets, nor does it refer to standard splits with citations.
Hardware Specification Yes Our experiments were conducted on servers equipped with 8 * Nvidia H800 GPUs.
Software Dependencies No We utilized verl2 for RL training and align-anything3 for reward modeling and supervised fine-tuning. ... our implementation is primarily based on verl 4, a training framework that supports Reinforcement Learning (RL) optimization for Multimodal Large Language Models (MLLMs).
Experiment Setup Yes Table 5: Hyperparameters of generative reward modeling from RL and RL optimization. Training Epochs 2 2 Train Batch Size 360 360 RL Mini Batch Size 128 128 Max Prompt Length 12800 4096 Actor Learning Rate 1E-6 1E-6...