Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

HelpSteer2-Preference: Complementing Ratings with Preferences

Authors: Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii Kuchaiev, Yi Dong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using this data, we conduct the first head-to-head comparison of Bradley-Terry and Regression models when adequately matched for data. We perform evaluation using Reward Bench (Lambert et al., 2024), a trusted reward modeling benchmark with over 140 models on the public leaderboard. Table 1: Performance of Models on Reward Bench.
Researcher Affiliation	Collaboration	Zhilin Wang1 Alexander Bukharin1,2 Olivier Delalleau1 Daniel Egert1 Gerald Shen1 Jiaqi Zeng1 Oleksii Kuchaiev1 Yi Dong1 EMAIL 1NVIDIA, 2Georgia Tech, work done during internship at NVIDIA
Pseudocode	No	The paper describes methods using mathematical equations for loss functions and textual explanations, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code	Yes	Reward Model: huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward Instruct Model: huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct
Open Datasets	Yes	Dataset (CC-BY-4.0-License): huggingface.co/datasets/nvidia/Help Steer2
Dataset Splits	Yes	Overall, we have 7,118 preference pairs with 6,766 pairs in the training set and 352 pairs in the validation set.
Hardware Specification	Yes	Experiments are run on nodes of 8 A100/H100-80GB SXM GPUs on internal clusters.
Software Dependencies	No	The paper mentions using NLTK for sentence tokenization and Scikit-Learn for kappa score calculation, and GPT-4-Turbo for evaluation, but does not provide specific version numbers for these software libraries or the framework used for model implementation.
Experiment Setup	Yes	Appendix E: TRAINING HYPER-PARAMETERS provides details on epochs, global batch sizes, learning rates, optimizers (AdamW), warm-up steps, and KL penalties for Reward Modelling, Direct Preference Optimization, Proximal Policy Optimization, and REINFORCE.