Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Pre-Trained Policy Discriminators are General Reward Models

Authors: Shihan Dou, Shichun Liu, Yuming Yang, Yicheng Zou, Yunhua Zhou, Shuhao Xing, Chenhao Huang, Qiming Ge, haijun Lv, Demin Song, Songyang Gao, Chengqi Lyu, Enyu Zhou, Honglin Guo, Zhiheng Xi, Qipeng Guo, Wenwei Zhang, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Kai Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Finetuning (RFT), providing reliable reward signals and markedly enhancing policy performance improving LLa Ma3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.
Researcher Affiliation Collaboration 1Shanghai AI Laboratory, 2Fudan University EMAIL, EMAIL
Pseudocode No The paper describes the methods using narrative text and mathematical equations in Section 3 but does not include any explicit pseudocode blocks or algorithms.
Open Source Code Yes https://github.com/Intern LM/POLAR. We have submitted our source code in the Supplementary Material. We will upload our code and data to Git Hub upon acceptance.
Open Datasets Yes The prompts are primarily sourced from widely used open-source preference-pair datasets such as Ultra Feedback [19] and HH-RLHF [5; 28], with a small subset derived from real user queries submitted to online chat platforms. Our primary evaluation uses the RMB benchmark [130], containing 3,162 questions, each with multiple trajectories ranked by preference scores. Table 8: Benchmarks we used in policy evaluation.
Dataset Splits Yes Our primary evaluation uses the RMB benchmark [130], containing 3,162 questions, each with multiple trajectories ranked by preference scores. The top-ranked trajectories are treated as references, representing samples drawn from a target policy. The task is to identify whether RMs correctly prefer the secondranked trajectory over the third-ranked one. Additionally, we create another evaluation set from real user queries collected through online platforms and manually annotate trajectory rankings (see Appendix E.2). We carefully remove overlaps with the training data to maintain independence. For supervised fine-tuning data, 'each prompt is associated with three outputs ranked from best to worst. The top two outputs constitute a positive pair, while the second and third-ranked outputs form a negative pair.'
Hardware Specification Yes The pre-training process is conducted on 320 NVIDIA H800 GPUs for a total duration of 57 hours. For pre-training POLAR-7B, we set N = 7B, Dp = 4.0T, and Drm = 3.6T. Then we get the learning rate 1.67e-5 and the batch size 4343. The pre-training process is conducted on 912 NVIDIA H800 GPUs for a total duration of 175 hours. Each round of supervised fine-tuning runs on 16 NVIDIA H800 GPUs for approximately 0.5 hours. PPO experiments for all policy models, except Qwen2.5-32B-Instruct, are conducted using 32 NVIDIA H800 GPUs per run, each taking approximately 48 hours. For Qwen2.5-32B-Instruct, we utilize 64 NVIDIA H800 GPUs per run, with each run lasting roughly 72 hours.
Software Dependencies No We adopt the XTuner3 framework for pre-training and fine-tuning. Policy optimization employs the Proximal Policy Optimization (PPO) algorithm [96]. PPO algorithm implemented in Open RLHF [43]. The paper mentions software frameworks like XTuner, PPO, and Open RLHF, but does not provide specific version numbers for any of them.
Experiment Setup Yes During the pre-training stage, ...we carried out scaling experiments designed to establish data-driven scaling laws... The results of these scaling experiments are illustrated in Figures 5 and 6... For POLAR-1.8B... learning rate to be 1.4e-5 and the batch size as 1940... For pre-training POLAR-7B...learning rate 1.67e-5 and the batch size 4343. For supervised fine-tuning of POLAR models, we set the learning rate to 1e-5 for the 1.8B model and 2e-5 for the 7B model, use a batch size of 320, and train for one epoch. For RLHF experiments, ...we set the actor learning rate to 1e-6, the critic learning rate to 1e-5, the training batch size to 1024, the rollout batch size to 1024, and the number of epochs to 1.