Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Provably Efficient Online RLHF with One-Pass Reward Modeling
Authors: Long-Fei Li, Yu-Yang Qian, Peng Zhao, Zhi-Hua Zhou
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we design practical algorithms for LLMs and conduct experiments with the Llama-3-8B-Instruct and Qwen2.5-7B-Instruct models on Ultrafeedback and Mixture2 datasets, validating the effectiveness of our approach. |
| Researcher Affiliation | Academia | Long-Fei Li , Yu-Yang Qian , Peng Zhao, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China EMAIL |
| Pseudocode | Yes | The detailed process of our proposed method is presented in Algorithm 1. Algorithm 1 One-Pass Reward Modeling |
| Open Source Code | Yes | 1The code is available at https://github.com/Zin YY/Online_RLHF |
| Open Datasets | Yes | With the above techniques, we conduct experiments using the LLa MA-3-8B-Instruct [Llama Team, 2023] and Qwen2.5-7B-Instruct [Qwen Team, 2024] models on the Ultrafeedback [Cui et al., 2024] and Mixture2 [Dong et al., 2024] datasets. |
| Dataset Splits | Yes | Passive data collection: We randomly choose 30, 000 samples from the Ultra Feedback-binarized dataset s train_prefs split for training. Each sample consists of a prompt and two responses with a label indicating the preferred response. We use the test_prefs split for evaluation. Active data collection: We allow the method to actively select 6,400 samples from the train_prefs split according to different selection strategies. The global batch size is set to 8 for training. The selection is performed iteratively, where in each iteration, the method selects the most informative samples based on its selection criterion. Deployment-time adaption: We use a pre-processed online variant of the Ultra Feedback-binarized dataset from the test_gen split. The dataset is divided into 20 sequential chunks to simulate an online deployment scenario. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models used for the experiments. |
| Software Dependencies | No | The paper mentions the use of the Adam optimizer but does not provide specific version numbers for key software components like programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | In our experiments, we set K = 3 and λ0 = 0.8 and choose the linear function f(t/T) = t/T as the damping function. ... The global batch size is set to 8 for training. ... We initialize the policy with 400 samples and use the same dataset settings as PPO to iteratively update the policy model using the DPO algorithm. |