Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Precise Asymptotics and Refined Regret of Variance-Aware UCB
Authors: Yingying Fan, Yuxuan Han, Jinchi Lv, Xiaocong Xu, Zhengyuan Zhou
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Figures 1 and 2b, we compare the empirical arm-pulling rates between UCB-V and the canonical UCB in a two-armed setting. Compared to the canonical UCB, its variance-aware version exhibits significantly greater fluctuations in arm-pulling numbers as the reward gap changes, and its arm-pulling distribution is more heavytailed. This highlights the significant differences in the variance-aware setting and introduces additional challenges. Simulations presented in Figure 2 confirm our theoretical predictions, demonstrating that UCB-V exhibits improved performance in high-σ1 scenarios, which was previously unknown. To illustrate the implication of our results on the reward inference, we conduct a simulation study on the Z-statistic for UCB and UCB-V using the setting from Example 2, as shown in Figure 4. |
| Researcher Affiliation | Academia | Yingying Fan University of Southern California EMAIL Yuxuan Han New York University EMAIL Jinchi Lv University of Southern California EMAIL Xiaocong Xu University of Southern California EMAIL Zhengyuan Zhou New York University EMAIL |
| Pseudocode | Yes | Algorithm 1 UCB-V Algorithm 1: Input: Arm number K, time horizon T, and exploration coefficient ρ. 2: Pull each of the K arms once in the first K iterations. 3: for t = K + 1 to T do 4: Compute arm pulls for arm a up to time t by na,t P s [t 1] 1{As = a}, a [K]. 5: Compute the empirical means and variances Xa,t 1 na,t s [t 1] 1{as = a}Xs, bσ2 a,t 1 na,t s [t 1] 1{as = a}(Xs Xa,t)2. 6: Compute the optimistic rewards UCB(a, t) Xa,t + bσa,t ρ log T 1 na,t ρ log T na,t for a [K]. 7: Choose arm At given by At = arg maxa [K] UCB(a, t). 8: end for |
| Open Source Code | No | The paper does not explicitly state that source code for the methodology is provided, nor does it include a link to a code repository. The NeurIPS Paper Checklist for this paper states: "The paper includes only simple simulations, and their settings are already described in detail." for the open-source code question. |
| Open Datasets | No | In all experiments, we use a Beta(α, β) distribution to generate rewards. Given a desired mean µ and variance σ2, and subject to the boundedness constraint on [0, 1], we set α = µ µ(1 µ) σ2 1 , β = (1 µ) µ(1 µ). This indicates that the data is generated for simulation rather than using an existing public dataset. |
| Dataset Splits | No | The paper uses generated data for simulations and specifies the number of repetitions (e.g., "5000 repetitions", "30 repetitions"). It does not mention traditional dataset splits like training, validation, or test sets, as is common for machine learning experiments on fixed datasets. |
| Hardware Specification | No | The paper does not provide specific hardware details used for running its experiments. The NeurIPS Paper Checklist for this paper states: "The simulations in the paper are simple and can be executed on a standard laptop." for the hardware specification question. |
| Software Dependencies | No | The paper does not provide specific software dependencies or version numbers needed to replicate the experiment. |
| Experiment Setup | Yes | In all experiments, we use a Beta(α, β) distribution to generate rewards. Given a desired mean µ and variance σ2, and subject to the boundedness constraint on [0, 1], we set α = µ µ(1 µ) σ2 1 , β = (1 µ) µ(1 µ). Figure 1: Distributions of n1,T . In both experiments that plot histograms of arm-pull counts, we set the time horizon to T = 50,000 and the number of repetitions to R = 5,000. The exploration hyperparameter is ρ = 2 for both UCB-V and UCB. The means and variances are set to µ1 = µ2 = 1 2 and σ1 = σ2 = 1 4 in Figure 1(a), and to σ1 = 0, σ2 = 1 4, and = σ2 p (log T)/T in Figure 1(b). Figure 2: Regret and phase transition of optimal-arm pulls. For panel (a), we set the times of repetition as 10, exploration hyper-parameter ρ = 2. We vary σ1 {T 1/2, T 1/4, 1} while keeping σ2 = T 1/4 and µ1 = 1 2, 2 = T 1/2, µ2 = µ1 + 2 fixed across curves. For panel (b), we set T = 1,000,000, and repetition time R = 30, ρ = 2, and fix σ1 = 0 and σ2 = 1 4. We sweep ΛT = σ2 ρ log T/( T 2) by varying 2; for each value we plot the median and the 30% quantile of n1,T for UCB and UCB-V. |