Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Variance-aware Regret Bounds for Stochastic Contextual Dueling Bandits
Authors: Qiwei Di, Tao Jin, Yue Wu, Heyang Zhao, Farzad Farnoud, Quanquan Gu
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose a new Sup Lin UCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound e O d q PT t=1 ฯ2 t +d , where ฯt is the variance of the pairwise comparison in round t, d is the dimension of the context vectors, and T is the time horizon. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an e O(d) regret. We perform empirical experiments on synthetic data to confirm the advantage of our method over previous variance-agnostic algorithms. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of California, Los Angeles 2Department of Computer Science, University of Virginia |
| Pseudocode | Yes | Algorithm 1 Variance-Aware Contextual Dueling Bandit (VACDB) |
| Open Source Code | Yes | Our implementation is publicly available 1. [1] https://github.com/uclaml/VACDB |
| Open Datasets | Yes | To showcase the performance of our algorithms in a real-world setting, we use Event Time dataset (Zhang et al., 2016). |
| Dataset Splits | No | The paper describes generating synthetic data and using a real-world dataset, but it does not specify explicit train/validation/test splits with percentages or sample counts for either dataset. |
| Hardware Specification | No | The paper describes simulation experiments but does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments. |
| Software Dependencies | No | The paper states that its implementation is publicly available, but it does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Experiment Setup. We study the proposed algorithm in simulation to compare it with those that are also designed for contextual dueling bandits. Each experiment instance is simulated for T = 4000 rounds. The unknown parameter ฮธ to be estimated is generated at random and normalized to be a unit vector. The feature dimension is set to d = 5. A total of |At| = 2d distinct contextual vectors are generated from { 1, 1}d. In each round, given the arm pair selected by the algorithm, a response is generated according to the random process defined in Section 3. For each experiment, a total of 128 repeated runs are carried out. We tune the confidence radius of each algorithm to showcase the best performance. The average cumulative regret is reported in Figure 1 along with the standard deviation in the shaded region. The link function ยต( ) is set to be the logistic function. |