Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

Authors: Heyang Zhao, Chenlu Ye, Quanquan Gu, Tong Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct experiments with synthetic data to investigate the benefit of mixed-policy sampling and the effect of KL-regularization coefficient on the sample complexity of the problem. We plot the experimental results for RL from preference feedback in Figure 2 and the results for KL-regularized contextual bandits in Figure 1. All the trials are repeated for 10 times and plotted with the standard variation.
Researcher Affiliation Academia Heyang Zhao University of California, Los Angeles Los Angeles, CA 90095 EMAIL Chenlu Ye University of Illinois Urbana-Champaign Urbana, IL 61801 EMAIL Quanquan Gu University of California, Los Angeles CA 90095, USA EMAIL Tong Zhang University of Illinois Urbana-Champaign Urbana, IL 61801 EMAIL
Pseudocode Yes Algorithm 1 Two-stage Mixed-Policy Sampling (TMPS) Algorithm 2 Active Querying for KL-Regularized Bandits Algorithm 3 Two-stage Mixed-Policy Sampling from Preference Feedback (TMPS-PF)
Open Source Code No Justification: This paper is purely theoretical with only synthetic experiments. (from question 5, Open access to data and code)
Open Datasets No In this section, we conduct experiments with synthetic data to investigate the benefit of mixed-policy sampling and the effect of KL-regularization coefficient on the sample complexity of the problem. ... We consider the case where context distribution d0 is a projected Gaussian distribution over the unit sphere and A is a discrete set with |A| = 5. We construct the reward functions as R(ϕ, x, a) = x, ϕ(a) , parameterized by a mapping ϕ from A to R10, and set the reference policy π0 to be the uniform random policy. To generate ϕ , we sample ϕ (a) independently for each a A according to another projected gaussian distribution over the sphere with radius equal to 5.
Dataset Splits No The paper uses synthetic data which is generated on the fly for experiments rather than a pre-existing dataset with defined splits. It mentions "All the trials are repeated for 10 times" which refers to experimental runs, not dataset splits.
Hardware Specification No Justification: This paper is purely theoretical. (from question 8, Experiments compute resources)
Software Dependencies No The paper does not mention any specific software or library names with version numbers.
Experiment Setup Yes In this section, we conduct experiments with synthetic data to investigate the benefit of mixed-policy sampling and the effect of KL-regularization coefficient on the sample complexity of the problem. We plot the experimental results for RL from preference feedback in Figure 2 and the results for KL-regularized contextual bandits in Figure 1. All the trials are repeated for 10 times and plotted with the standard variation. ... We consider the case where context distribution d0 is a projected Gaussian distribution over the unit sphere and A is a discrete set with |A| = 5. We construct the reward functions as R(ϕ, x, a) = x, ϕ(a) , parameterized by a mapping ϕ from A to R10, and set the reference policy π0 to be the uniform random policy. To generate ϕ , we sample ϕ (a) independently for each a A according to another projected gaussian distribution over the sphere with radius equal to 5. In Figure 2(b), it is shown that the sample complexity is remarkably affected by the KL-regularization term, corroborating our sharp analysis for regularized RLHF.