Practical Contextual Bandits with Feedback Graphs
Authors: Mengxiao Zhang, Yuheng Zhang, Olga Vrousgou, Haipeng Luo, Paul Mineiro
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we use empirical results to demonstrate the significant benefits of Square CB.G in leveraging the graph feedback structure and its superior effectiveness compared to Square CB. Following Foster and Krishnamurthy [2021], we use progressive validation (PV) loss as the evaluation metric, defined as Lpv(T) = 1 T PT t=1 ℓt,at. All the feedback graphs used in the experiments are deterministic. We run experiments on CPU Intel Xeon Gold 6240R 2.4G and the convex program solver is implemented via Vowpal Wabbit [Langford et al., 2007]. Figure 1: Left figure: Performance of Square CB.G on RCV1 dataset under three different feedback graphs. Right figure: Performance comparison between Square CB.G and Square CB under random directed self-aware feedback graphs. |
| Researcher Affiliation | Collaboration | Mengxiao Zhang University of Southern California mengxiao.zhang@usc.edu Yuheng Zhang University of Illinois Urbana-Champaign yuhengz2@illinois.edu Olga Vrousgou Microsoft Research Olga.Vrousgou@microsoft.com Haipeng Luo University of Southern California haipengl@usc.edu Paul Mineiro Microsoft Research pmineiro@microsoft.com |
| Pseudocode | Yes | Algorithm 1 Square CB.G. Note Theorem 3.6 provides an efficient implementation of Eq. (4). Input: parameter γ 4, a regression oracle Alg Sq for t = 1, 2, . . . , T do Receive context xt and directed feedback graph Gt. Obtain an estimator bft from the oracle Alg Sq. Compute the distribution pt (K) such that pt = argminp (K) decγ(p; bft, xt, Gt), where decγ(p; bft, xt, Gt) := sup a [K] f Φ f (xt, a) f (xt, a ) γ 4 EA Gt( |a) a A ( bft(xt, a ) f (xt, a ))2 ## and Φ := X [K] 7 R. Sample at from pt and observe {ℓt,j}j At where At Gt( |at). Feed the tuples {(xt, j, ℓt,j)}j At to the oracle Alg Sq. end |
| Open Source Code | No | The paper states: 'Our code is built on Py Torch framework [Paszke et al., 2019]' and 'the convex program solver is implemented via Vowpal Wabbit [Langford et al., 2007]'. Appendix A.3 provides a 'Python Solution to Eq. (5)' as a code snippet. However, there is no explicit statement or link indicating that the authors are releasing the complete source code for their proposed methodology (Square CB.G). |
| Open Datasets | Yes | We conduct experiments on RCV1 [Lewis et al., 2004], which is a multilabel text-categorization dataset. |
| Dataset Splits | No | The paper mentions using 'progressive validation (PV) loss' as an evaluation metric and uses 'RCV1 dataset' and a 'Synthetic Inventory Dataset'. While these indicate data usage, the paper does not explicitly provide information on specific train, validation, or test dataset splits (e.g., percentages, sample counts, or references to predefined splits). |
| Hardware Specification | Yes | We run experiments on CPU Intel Xeon Gold 6240R 2.4G |
| Software Dependencies | No | The paper states: 'Our code is built on Py Torch framework [Paszke et al., 2019]' and 'the convex program solver is implemented via Vowpal Wabbit [Langford et al., 2007]'. While these are software dependencies, specific version numbers for PyTorch and Vowpal Wabbit (other than the year of the paper they cite) are not provided, which is required for a reproducible description of ancillary software. |
| Experiment Setup | Yes | The oracle is implemented by applying online gradient descent with learning rate η searched over {0.1, 0.2, 0.5, 1, 2, 4}. As suggested by [Foster and Krishnamurthy, 2021], we use a time-varying exploration parameter γt = c / √αt, where t is the index of the iteration, c is searched over {8, 16, 32, 64, 128}, and α is the independence number of the corresponding feedback graph. |