Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Greedy Sampling Is Provably Efficient For RLHF

Authors: Di Wu, Chengshuai Shi, Jing Yang, Cong Shen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The theoretical results (i.e., the efficiency of directly using empirical estimates) are corroborated with simulation results. In particular, under both general and BT preference models, experimental results demonstrate that the simple greedy sampling approach achieves statistically comparable performance to prior methods with more sophisticated policy constructions. Experiments are conducted to corroborate the theoretical findings.
Researcher Affiliation Academia Di Wu Electrical and Computer Engineering University of Virginia Charlottesville, VA 22903 EMAIL Chengshuai Shi Princeton Language and Intelligence Princeton University Princeton, NJ 08540 EMAIL Jing Yang Electrical and Computer Engineering University of Virginia Charlottesville, VA 22903 EMAIL Cong Shen Electrical and Computer Engineering University of Virginia Charlottesville, VA 22903 EMAIL
Pseudocode Yes Algorithm 1 Online RLHF with Greedy Sampling ... Algorithm 2 Offline RLHF with Greedy Sampling
Open Source Code Yes The experiment codes have been uploaded in the supplementary materials, which will be made publicly accessible upon acceptance.
Open Datasets No For both the general preference model and the BT model, we consider the linear setting with randomly sampled context vectors and 6 fixed actions. In particular, the general preference model is considered to be a linear one with dimension k k k while the BT model with dimension k k, where k is set to 5. Detailed experimental setups and implementation details are deferred to Appendix E. For implementation, we choose k = 5 and first uniformly randomly sample from [0, 1] to construct the ground-truth preference model parameters M and W . We similarly sample 6 vectors from [0, 1]5 as the action set A. In each iteration, we randomly sample a vector from the uniform distribution in [0, 1]5 as the context vector, and then we sample action pairs (a1, a2) based on the policies.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits. It describes generating data online by sampling context vectors and action pairs, and using a pre-collected offline dataset D0, but does not detail any splits for D0. Algorithm 2 Offline RLHF with Greedy Sampling: Input: parameter η, reference policy π0, pre-collected data D0 = {(xi, a1 i , a2 i , yi)}m i=1.
Hardware Specification No The reported experiments are light in computation, which are performed with a mainstream laptop and can be executed in minutes.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes For implementation, we choose k = 5 and first uniformly randomly sample from [0, 1] to construct the ground-truth preference model parameters M and W . We similarly sample 6 vectors from [0, 1]5 as the action set A. In each iteration, we randomly sample a vector from the uniform distribution in [0, 1]5 as the context vector, and then we sample action pairs (a1, a2) based on the policies. We run the trajectory for T iterations and repeat the experiments 5 times, computing the averages and standard deviations. In both settings, the regularization coefficient η is set to 1. To simulate different degrees of optimism, we use 1 and 3 as the number of responses selected in the tournament. To consider different levels of optimism, β = 0.3 and 0.5 are used.