Stochastic Bandits with Graph Feedback in Non-Stationary Environments
Authors: Shiyin Lu, Yao Hu, Lijun Zhang8758-8766
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present experimental results to illustrate the empirical performance of our proposed algorithms. For UCB-NEW and SEASIDE which require prior knowledge of L to achieve the regret bounds scaling with L, we examine two versions, i.e., one with and the other without tuning parameters in terms of L. For ASG, we reduce its computational cost according to Remark 2. We adopt UCB-NE (Hu, Mehta, and Pan 2019), Exp3-SET (Alon et al. 2017), EAS (Elimination with Alpha Sample, Cohen, Hazan, and Koren 2016), Greedy-LP and UCB-LP (Buccapatnam, Eryilmaz, and Shroff 2014) as baseline algorithms. We use a synthetic dataset constructed as follows. Let K = 30, T = 1000000 and pick L from {10, 20}. We first randomly choose L-1 breakpoints from {2, ..., T-1} to partition [1, T] into L stationary intervals. Then, in each stationary interval, we choose 2 arms uniformly at random from [K] as optimal arms and set their mean rewards to be 0.9. For suboptimal arms, their mean rewards are sampled from a uniform distribution with support [0, 0.7]. For each arm, we generate its rewards by drawing samples from truncated normal distributions with support [0, 1] and variance 0.01. Finally, inspired by Brigham and Dutton (1983), we construct a feedback graph illustrated at Appendix A with α = 10 and θ = 14. We run each algorithm 10 times and report the average performance in Fig. 1, where -w/o stands for without prior knowledge of L . As can be seen, our proposed algorithms significantly outperform the baseline algorithms, which is expected since the baseline algorithms assume the reward of each arm is either drawn from a stationary distribution or determined by an adversary. Furthermore, without prior knowledge of L, UCB-NEW and SEASIDE still behave well and achieve much smaller regrets than the baseline algorithms, demonstrating their practicality. Finally, while SEASIDE attains the smallest regret with prior knowledge of L, it becomes inferior to ASG when L is unknown, which validates the advantage of ASG’s adaptivity. |
| Researcher Affiliation | Collaboration | Shiyin Lu,1 Yao Hu,2 Lijun Zhang1, 1 National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China 2 You Ku Cognitive and Intelligent Lab, Alibaba Group, Beijing 100102, China {lusy, zhanglj}@lamda.nju.edu.cn, yaoohu@alibaba-inc.com |
| Pseudocode | Yes | Algorithm 1 UCB-NEW; Algorithm 2 SEASIDE; Algorithm 3 ASG; Algorithm 4 Compute W |
| Open Source Code | No | The paper does not provide any statements or links indicating that the source code for its methodology is open-source or publicly available. |
| Open Datasets | No | The paper uses a 'synthetic dataset' and describes its construction in detail, but it does not provide concrete access information (e.g., a link, DOI, or specific citation to a public repository) for the generated dataset itself. The citation for Brigham and Dutton (1983) refers to the graph construction, not the dataset. |
| Dataset Splits | No | The paper describes the generation of a synthetic dataset over a time horizon T but does not specify any explicit training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies or their version numbers that would be needed to replicate the experiments. |
| Experiment Setup | Yes | We use a synthetic dataset constructed as follows. Let K = 30, T = 1000000 and pick L from {10, 20}. We run each algorithm 10 times and report the average performance in Fig. 1. ... when the number of reward distribution changes L is known in advance, by setting ρ optimally as ρ = p T/L , UCB-NEW achieves an O(θLT) dynamic regret bound. ... by setting p optimally as p = p L/(16αT), SEASIDE achieves an O(αLT) dynamic regret bound. |