Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models
Authors: Ilgee Hong, Changlong Yu, Liang Qiu, Weixiang Yan, Zhenghao Xu, Haoming Jiang, Qingru Zhang, Qin Lu, Xin Liu, Chao Zhang, Tuo Zhao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that Think-RM outperforms both BT RM and vertically scaled Gen RM on both in-distribution (ID) and OOD tasks, with particularly strong gains on reasoning-heavy benchmarks: more than 10% and 5% on Reward Bench s Chat Hard and Reasoning, and 12% on RM-Bench s Math domain. When combined with our pairwise RLHF pipeline, it demonstrates superior end-policy performance compared to traditional approaches. |
| Researcher Affiliation | Collaboration | Ilgee Hong1 Changlong Yu2 Liang Qiu2 Weixiang Yan2 Zhenghao Xu1 Haoming Jiang2 Qingru Zhang1 Qin Lu2 Xin Liu2 Chao Zhang2 Tuo Zhao2 1Georgia Institute of Technology 2Amazon |
| Pseudocode | No | The paper describes methods and processes in narrative text and mathematical formulas but does not include any explicitly labeled pseudocode or algorithm blocks. Figure 2 provides a high-level overview diagram, and sections like 3.2 and 3.3 describe steps in paragraph form, not code-like formatting. |
| Open Source Code | Yes | The code, datasets, and models are publicly available at https://github.com/Ilgee Hong/Think-RM. |
| Open Datasets | Yes | The code, datasets, and models are publicly available at https://github.com/Ilgee Hong/Think-RM. Training Data and Baselines. We use Help Steer2-Preference [46] as training data for all baseline methods and Think-RM. |
| Dataset Splits | No | After removing tie samples and excluding the test split, we obtain 6,766 training samples. ... This results in 6K training samples for binary preference and 4K for multiclass preference. We use Help Steer3-Preference [49] as a benchmark to evaluate generalization under moderate distribution shift. Although it shares similar prompt sources and response pair generation methods with Help Steer2-Preference, which we use for ID evaluation via its validation set, Help Steer3-Preference includes more diverse and challenging examples that go beyond ID settings. |
| Hardware Specification | Yes | We train Think-RM using eight A100 GPUs (1 node), each with 80GB of memory. For pairwise RLHF training, we use sixteen A100 GPUs (2 nodes), each with 80GB of memory: one node is allocated for RL training and the other for Gen RM inference. |
| Software Dependencies | No | We use Open RLHF [51] to train BT RM and all SFT models, and Ve RL [52] for all RL experiments (Think-RM s rule-based RL stage and pairwise RLHF with Gen RMs). For warm-up SFT, we use the Adam optimizer [53] with β1 = 0.9 and β2 = 0.95, which are the default settings in Open RLHF [51]. For rule-based RL, we use the Adam W optimizer [54] with β1 = 0.9 and β2 = 0.999, following the default settings in Ve RL [52]. |
| Experiment Setup | Yes | For warm-up SFT, we fine-tune Llama-3.1-8B-Instruct for 5 epochs with a learning rate of 1e-5 for binary outputs and 5e-6 for multiclass outputs. We fine-tune Qwen2.5-3B-Instruct for 10 epochs with a learning rate of 1e-5 for binary outputs. At the rule-based RL stage, we use a rollout batch size of 512, a KL coefficient of β = 1e-4, and a group size of G = 8 for both binary and multiclass settings. The learning rate is 2e-6 for Llama-3.1-8B-Instruct and 1e-6 for Qwen2.5-3B-Instruct. For RLHF experiments, we use the same hyperparameters as in rule-based RL, except we reduce the group size to G = 4. Additional implementation details are provided in Appendix C. |