Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
HelpSteer2-Preference: Complementing Ratings with Preferences
Authors: Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii Kuchaiev, Yi Dong
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using this data, we conduct the first head-to-head comparison of Bradley-Terry and Regression models when adequately matched for data. We perform evaluation using Reward Bench (Lambert et al., 2024), a trusted reward modeling benchmark with over 140 models on the public leaderboard. Table 1: Performance of Models on Reward Bench. |
| Researcher Affiliation | Collaboration | Zhilin Wang1 Alexander Bukharin1,2 Olivier Delalleau1 Daniel Egert1 Gerald Shen1 Jiaqi Zeng1 Oleksii Kuchaiev1 Yi Dong1 EMAIL 1NVIDIA, 2Georgia Tech, work done during internship at NVIDIA |
| Pseudocode | No | The paper describes methods using mathematical equations for loss functions and textual explanations, but no explicit pseudocode or algorithm blocks are provided. |
| Open Source Code | Yes | Reward Model: huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward Instruct Model: huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct |
| Open Datasets | Yes | Dataset (CC-BY-4.0-License): huggingface.co/datasets/nvidia/Help Steer2 |
| Dataset Splits | Yes | Overall, we have 7,118 preference pairs with 6,766 pairs in the training set and 352 pairs in the validation set. |
| Hardware Specification | Yes | Experiments are run on nodes of 8 A100/H100-80GB SXM GPUs on internal clusters. |
| Software Dependencies | No | The paper mentions using NLTK for sentence tokenization and Scikit-Learn for kappa score calculation, and GPT-4-Turbo for evaluation, but does not provide specific version numbers for these software libraries or the framework used for model implementation. |
| Experiment Setup | Yes | Appendix E: TRAINING HYPER-PARAMETERS provides details on epochs, global batch sizes, learning rates, optimizers (AdamW), warm-up steps, and KL penalties for Reward Modelling, Direct Preference Optimization, Proximal Policy Optimization, and REINFORCE. |