Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Provably Efficient Online RLHF with One-Pass Reward Modeling

Authors: Long-Fei Li, Yu-Yang Qian, Peng Zhao, Zhi-Hua Zhou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we design practical algorithms for LLMs and conduct experiments with the Llama-3-8B-Instruct and Qwen2.5-7B-Instruct models on Ultrafeedback and Mixture2 datasets, validating the effectiveness of our approach.
Researcher Affiliation Academia Long-Fei Li , Yu-Yang Qian , Peng Zhao, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China EMAIL
Pseudocode Yes The detailed process of our proposed method is presented in Algorithm 1. Algorithm 1 One-Pass Reward Modeling
Open Source Code Yes 1The code is available at https://github.com/Zin YY/Online_RLHF
Open Datasets Yes With the above techniques, we conduct experiments using the LLa MA-3-8B-Instruct [Llama Team, 2023] and Qwen2.5-7B-Instruct [Qwen Team, 2024] models on the Ultrafeedback [Cui et al., 2024] and Mixture2 [Dong et al., 2024] datasets.
Dataset Splits Yes Passive data collection: We randomly choose 30, 000 samples from the Ultra Feedback-binarized dataset s train_prefs split for training. Each sample consists of a prompt and two responses with a label indicating the preferred response. We use the test_prefs split for evaluation. Active data collection: We allow the method to actively select 6,400 samples from the train_prefs split according to different selection strategies. The global batch size is set to 8 for training. The selection is performed iteratively, where in each iteration, the method selects the most informative samples based on its selection criterion. Deployment-time adaption: We use a pre-processed online variant of the Ultra Feedback-binarized dataset from the test_gen split. The dataset is divided into 20 sequential chunks to simulate an online deployment scenario.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models used for the experiments.
Software Dependencies No The paper mentions the use of the Adam optimizer but does not provide specific version numbers for key software components like programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes In our experiments, we set K = 3 and λ0 = 0.8 and choose the linear function f(t/T) = t/T as the damping function. ... The global batch size is set to 8 for training. ... We initialize the policy with 400 samples and use the same dataset settings as PPO to iteratively update the policy model using the DPO algorithm.