Stochastic Bandits with Graph Feedback in Non-Stationary Environments

Authors: Shiyin Lu, Yao Hu, Lijun Zhang8758-8766

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present experimental results to illustrate the empirical performance of our proposed algorithms. For UCB-NEW and SEASIDE which require prior knowledge of L to achieve the regret bounds scaling with L, we examine two versions, i.e., one with and the other without tuning parameters in terms of L. For ASG, we reduce its computational cost according to Remark 2. We adopt UCB-NE (Hu, Mehta, and Pan 2019), Exp3-SET (Alon et al. 2017), EAS (Elimination with Alpha Sample, Cohen, Hazan, and Koren 2016), Greedy-LP and UCB-LP (Buccapatnam, Eryilmaz, and Shroff 2014) as baseline algorithms. We use a synthetic dataset constructed as follows. Let K = 30, T = 1000000 and pick L from {10, 20}. We first randomly choose L-1 breakpoints from {2, ..., T-1} to partition [1, T] into L stationary intervals. Then, in each stationary interval, we choose 2 arms uniformly at random from [K] as optimal arms and set their mean rewards to be 0.9. For suboptimal arms, their mean rewards are sampled from a uniform distribution with support [0, 0.7]. For each arm, we generate its rewards by drawing samples from truncated normal distributions with support [0, 1] and variance 0.01. Finally, inspired by Brigham and Dutton (1983), we construct a feedback graph illustrated at Appendix A with α = 10 and θ = 14. We run each algorithm 10 times and report the average performance in Fig. 1, where -w/o stands for without prior knowledge of L . As can be seen, our proposed algorithms significantly outperform the baseline algorithms, which is expected since the baseline algorithms assume the reward of each arm is either drawn from a stationary distribution or determined by an adversary. Furthermore, without prior knowledge of L, UCB-NEW and SEASIDE still behave well and achieve much smaller regrets than the baseline algorithms, demonstrating their practicality. Finally, while SEASIDE attains the smallest regret with prior knowledge of L, it becomes inferior to ASG when L is unknown, which validates the advantage of ASG’s adaptivity.
Researcher Affiliation Collaboration Shiyin Lu,1 Yao Hu,2 Lijun Zhang1, 1 National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China 2 You Ku Cognitive and Intelligent Lab, Alibaba Group, Beijing 100102, China {lusy, zhanglj}@lamda.nju.edu.cn, yaoohu@alibaba-inc.com
Pseudocode Yes Algorithm 1 UCB-NEW; Algorithm 2 SEASIDE; Algorithm 3 ASG; Algorithm 4 Compute W
Open Source Code No The paper does not provide any statements or links indicating that the source code for its methodology is open-source or publicly available.
Open Datasets No The paper uses a 'synthetic dataset' and describes its construction in detail, but it does not provide concrete access information (e.g., a link, DOI, or specific citation to a public repository) for the generated dataset itself. The citation for Brigham and Dutton (1983) refers to the graph construction, not the dataset.
Dataset Splits No The paper describes the generation of a synthetic dataset over a time horizon T but does not specify any explicit training, validation, or test dataset splits.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies No The paper does not specify any software dependencies or their version numbers that would be needed to replicate the experiments.
Experiment Setup Yes We use a synthetic dataset constructed as follows. Let K = 30, T = 1000000 and pick L from {10, 20}. We run each algorithm 10 times and report the average performance in Fig. 1. ... when the number of reward distribution changes L is known in advance, by setting ρ optimally as ρ = p T/L , UCB-NEW achieves an O(θLT) dynamic regret bound. ... by setting p optimally as p = p L/(16αT), SEASIDE achieves an O(αLT) dynamic regret bound.