Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Ask a Strong LLM Judge when Your Reward Model is Uncertain
Authors: Zhenghao Xu, Qin Lu, Qingru Zhang, Liang Qiu, Ilgee Hong, Changlong Yu, Wenlin Yao, Yao Liu, Haoming Jiang, Lihong Li, Hyokun Yun, Tuo Zhao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on RM benchmarks demonstrate that our uncertainty-based routing strategy significantly outperforms random judge calling at the same cost, and downstream alignment results showcase its effectiveness in improving online RLHF. |
| Researcher Affiliation | Collaboration | Zhenghao Xu1 Qin Lu2 Qingru Zhang1 Liang Qiu2 Ilgee Hong1 Changlong Yu2 Wenlin Yao2 Yao Liu2 Haoming Jiang2 Lihong Li2 Hyokun Yun2 Tuo Zhao1 1Georgia Institute of Technology 2Amazon |
| Pseudocode | No | The paper only describes methods using mathematical formulations and descriptive text, without any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, or structured code-like procedures. |
| Open Source Code | Yes | Our code is available at https:// github.com/zhenghaoxu-gatech/uncertainty-router. |
| Open Datasets | Yes | For PM training, we use the Help Steer2-Preference dataset [51]... For downstream alignment, we use a subset (the first 33%) of the prompt from the Ultrafeedback dataset [8]... For RM evaluation, we use Reward Bench [25] and RM-Bench [31] datasets. |
| Dataset Splits | Yes | For PM training, we use the Help Steer2-Preference dataset [51], which consists of 7,118 high-quality preference pairs with 6,766 training data pairs and 352 validation data pairs. |
| Hardware Specification | Yes | We run evaluations on 4 NVIDIA-A100 GPUs in parallel, each processing 25% of the comparisons with an SNGP-PM. |
| Software Dependencies | No | Our implementation of SNGP follows [37], particularly the one applied to Bert. Specifically, we only apply spectral normalization to the linear layer in the last decoder and set the spectral normalization range to 1... For training, we follow the code from Open RLHF4, which is an easy-to-use, high-performance open-source RLHF framework [20]. Hyperparameters are summarized in Table 7. |
| Experiment Setup | Yes | Table 7: Training configurations for PM and SNGP-PM. Base model name Llama-3.1-8B-Instruct Batch size 256 Micro batch size 16 Training epochs 2 (3 if counting the covariance calculation pass) Quantization BFloat16 Learning rate (LR) {2e-6, 3e-6, 4e-6, 5e-6, 6e-6} Learning rate scheduler Cosine with min LR (0.1 base LR) Warm up ratio 0.03 Gradient accumulation steps 16 Max input length 8192 Deep Speed Zero stage 2 Flash attention Enabled |