Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Greedy Sampling Is Provably Efficient For RLHF
Authors: Di Wu, Chengshuai Shi, Jing Yang, Cong Shen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The theoretical results (i.e., the efficiency of directly using empirical estimates) are corroborated with simulation results. In particular, under both general and BT preference models, experimental results demonstrate that the simple greedy sampling approach achieves statistically comparable performance to prior methods with more sophisticated policy constructions. Experiments are conducted to corroborate the theoretical findings. |
| Researcher Affiliation | Academia | Di Wu Electrical and Computer Engineering University of Virginia Charlottesville, VA 22903 EMAIL Chengshuai Shi Princeton Language and Intelligence Princeton University Princeton, NJ 08540 EMAIL Jing Yang Electrical and Computer Engineering University of Virginia Charlottesville, VA 22903 EMAIL Cong Shen Electrical and Computer Engineering University of Virginia Charlottesville, VA 22903 EMAIL |
| Pseudocode | Yes | Algorithm 1 Online RLHF with Greedy Sampling ... Algorithm 2 Offline RLHF with Greedy Sampling |
| Open Source Code | Yes | The experiment codes have been uploaded in the supplementary materials, which will be made publicly accessible upon acceptance. |
| Open Datasets | No | For both the general preference model and the BT model, we consider the linear setting with randomly sampled context vectors and 6 fixed actions. In particular, the general preference model is considered to be a linear one with dimension k k k while the BT model with dimension k k, where k is set to 5. Detailed experimental setups and implementation details are deferred to Appendix E. For implementation, we choose k = 5 and first uniformly randomly sample from [0, 1] to construct the ground-truth preference model parameters M and W . We similarly sample 6 vectors from [0, 1]5 as the action set A. In each iteration, we randomly sample a vector from the uniform distribution in [0, 1]5 as the context vector, and then we sample action pairs (a1, a2) based on the policies. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits. It describes generating data online by sampling context vectors and action pairs, and using a pre-collected offline dataset D0, but does not detail any splits for D0. Algorithm 2 Offline RLHF with Greedy Sampling: Input: parameter η, reference policy π0, pre-collected data D0 = {(xi, a1 i , a2 i , yi)}m i=1. |
| Hardware Specification | No | The reported experiments are light in computation, which are performed with a mainstream laptop and can be executed in minutes. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. |
| Experiment Setup | Yes | For implementation, we choose k = 5 and first uniformly randomly sample from [0, 1] to construct the ground-truth preference model parameters M and W . We similarly sample 6 vectors from [0, 1]5 as the action set A. In each iteration, we randomly sample a vector from the uniform distribution in [0, 1]5 as the context vector, and then we sample action pairs (a1, a2) based on the policies. We run the trajectory for T iterations and repeat the experiments 5 times, computing the averages and standard deviations. In both settings, the regularization coefficient η is set to 1. To simulate different degrees of optimism, we use 1 and 3 as the number of responses selected in the tournament. To consider different levels of optimism, β = 0.3 and 0.5 are used. |