Restless-UCB, an Efficient and Low-complexity Algorithm for Online Restless Bandits

Authors: Siwei Wang, Longbo Huang, John C. S. Lui

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we conduct experiments based on real-world dataset, to compare the Restless-UCB policy with state-of-the-art benchmarks. Our results show that Restless-UCB outperforms existing algorithms in regret, and significantly reduces the running time.
Researcher Affiliation Academia 1Department of Computer Science and Technology, Tsinghua University wangsw2020@mail.tsinghua.edu.cn 2Institute for Interdisciplinary Information Sciences, Tsinghua University longbohuang@mail.tsinghua.edu.cn 3Department of Computer Science and Engineering, The Chinese University of Hong Kong cslui@cse.cuhk.edu.hk
Pseudocode Yes Algorithm 1 Restless-UCB Policy
Open Source Code No The paper does not include an explicit statement about releasing the source code for the described methodology, nor does it provide a direct link to a code repository.
Open Datasets Yes Here we use the wireless communication dataset in [37].
Dataset Splits No The paper mentions using 'constructed instances' and a 'wireless communication dataset in [37]', but it does not specify explicit training, validation, or test dataset splits (e.g., percentages, sample counts, or predefined split citations).
Hardware Specification Yes The results in Table 1 take average over 50 runs of a single-threaded program (running on an Intel E5-2660 v3 workstation with 256GB RAM).
Software Dependencies No The paper mentions 'a single-threaded program' but does not provide specific software dependencies with version numbers (e.g., Python 3.x, PyTorch x.x, or specific solver versions).
Experiment Setup Yes In all these experiments, we use the offline policy proposed by [28] as the offline oracle of restless bandit problems. We consider two problem instances, each instance contains two arms and each arm evolves according to a two-state Markov chain. In both instances, r(i, 2) = 0 for any arm i, and all Markov chains start at state 2. In problem instance one, r(1, 1) = 1, r(2, 1) = 0.8, P1(1, 1) = 0.7, P1(2, 2) = 0.8, P2(1, 1) = 0.5 and P2(2, 2) = 0.6. We use the transition probability matrices in Tables IV and VI in [37]. We also use the average direct signal mean given in Tables III and V in [37] as the expected reward.