reproducibilityindex.ai

Variance-aware Regret Bounds for Stochastic Contextual Dueling Bandits

Authors: Qiwei Di, Tao Jin, Yue Wu, Heyang Zhao, Farzad Farnoud, Quanquan Gu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose a new Sup Lin UCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound e O d q PT t=1 σ2 t +d , where σt is the variance of the pairwise comparison in round t, d is the dimension of the context vectors, and T is the time horizon. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an e O(d) regret. We perform empirical experiments on synthetic data to confirm the advantage of our method over previous variance-agnostic algorithms.
Researcher Affiliation	Academia	1Department of Computer Science, University of California, Los Angeles 2Department of Computer Science, University of Virginia
Pseudocode	Yes	Algorithm 1 Variance-Aware Contextual Dueling Bandit (VACDB)
Open Source Code	Yes	Our implementation is publicly available 1. [1] https://github.com/uclaml/VACDB
Open Datasets	Yes	To showcase the performance of our algorithms in a real-world setting, we use Event Time dataset (Zhang et al., 2016).
Dataset Splits	No	The paper describes generating synthetic data and using a real-world dataset, but it does not specify explicit train/validation/test splits with percentages or sample counts for either dataset.
Hardware Specification	No	The paper describes simulation experiments but does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies	No	The paper states that its implementation is publicly available, but it does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Experiment Setup. We study the proposed algorithm in simulation to compare it with those that are also designed for contextual dueling bandits. Each experiment instance is simulated for T = 4000 rounds. The unknown parameter θ to be estimated is generated at random and normalized to be a unit vector. The feature dimension is set to d = 5. A total of \|At\| = 2d distinct contextual vectors are generated from { 1, 1}d. In each round, given the arm pair selected by the algorithm, a response is generated according to the random process defined in Section 3. For each experiment, a total of 128 repeated runs are carried out. We tune the confidence radius of each algorithm to showcase the best performance. The average cumulative regret is reported in Figure 1 along with the standard deviation in the shaded region. The link function µ( ) is set to be the logistic function.