Variance-aware Regret Bounds for Stochastic Contextual Dueling Bandits

Authors: Qiwei Di, Tao Jin, Yue Wu, Heyang Zhao, Farzad Farnoud, Quanquan Gu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose a new Sup Lin UCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound e O d q PT t=1 σ2 t +d , where σt is the variance of the pairwise comparison in round t, d is the dimension of the context vectors, and T is the time horizon. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an e O(d) regret. We perform empirical experiments on synthetic data to confirm the advantage of our method over previous variance-agnostic algorithms.
Researcher Affiliation Academia 1Department of Computer Science, University of California, Los Angeles 2Department of Computer Science, University of Virginia
Pseudocode Yes Algorithm 1 Variance-Aware Contextual Dueling Bandit (VACDB)
Open Source Code Yes Our implementation is publicly available 1. [1] https://github.com/uclaml/VACDB
Open Datasets Yes To showcase the performance of our algorithms in a real-world setting, we use Event Time dataset (Zhang et al., 2016).
Dataset Splits No The paper describes generating synthetic data and using a real-world dataset, but it does not specify explicit train/validation/test splits with percentages or sample counts for either dataset.
Hardware Specification No The paper describes simulation experiments but does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No The paper states that its implementation is publicly available, but it does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Experiment Setup. We study the proposed algorithm in simulation to compare it with those that are also designed for contextual dueling bandits. Each experiment instance is simulated for T = 4000 rounds. The unknown parameter θ to be estimated is generated at random and normalized to be a unit vector. The feature dimension is set to d = 5. A total of |At| = 2d distinct contextual vectors are generated from { 1, 1}d. In each round, given the arm pair selected by the algorithm, a response is generated according to the random process defined in Section 3. For each experiment, a total of 128 repeated runs are carried out. We tune the confidence radius of each algorithm to showcase the best performance. The average cumulative regret is reported in Figure 1 along with the standard deviation in the shaded region. The link function µ( ) is set to be the logistic function.