Variance-aware Regret Bounds for Stochastic Contextual Dueling Bandits
Authors: Qiwei Di, Tao Jin, Yue Wu, Heyang Zhao, Farzad Farnoud, Quanquan Gu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose a new Sup Lin UCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound e O d q PT t=1 σ2 t +d , where σt is the variance of the pairwise comparison in round t, d is the dimension of the context vectors, and T is the time horizon. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an e O(d) regret. We perform empirical experiments on synthetic data to confirm the advantage of our method over previous variance-agnostic algorithms. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of California, Los Angeles 2Department of Computer Science, University of Virginia |
| Pseudocode | Yes | Algorithm 1 Variance-Aware Contextual Dueling Bandit (VACDB) |
| Open Source Code | Yes | Our implementation is publicly available 1. [1] https://github.com/uclaml/VACDB |
| Open Datasets | Yes | To showcase the performance of our algorithms in a real-world setting, we use Event Time dataset (Zhang et al., 2016). |
| Dataset Splits | No | The paper describes generating synthetic data and using a real-world dataset, but it does not specify explicit train/validation/test splits with percentages or sample counts for either dataset. |
| Hardware Specification | No | The paper describes simulation experiments but does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments. |
| Software Dependencies | No | The paper states that its implementation is publicly available, but it does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Experiment Setup. We study the proposed algorithm in simulation to compare it with those that are also designed for contextual dueling bandits. Each experiment instance is simulated for T = 4000 rounds. The unknown parameter θ to be estimated is generated at random and normalized to be a unit vector. The feature dimension is set to d = 5. A total of |At| = 2d distinct contextual vectors are generated from { 1, 1}d. In each round, given the arm pair selected by the algorithm, a response is generated according to the random process defined in Section 3. For each experiment, a total of 128 repeated runs are carried out. We tune the confidence radius of each algorithm to showcase the best performance. The average cumulative regret is reported in Figure 1 along with the standard deviation in the shaded region. The link function µ( ) is set to be the logistic function. |