Online Restless Bandits with Unobserved States

Authors: Bowen Jiang, Bo Jiang, Jian Li, Tao Lin, Xinbing Wang, Chenghu Zhou

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show through simulations that TSEETC outperforms existing algorithms in regret. We conduct the proof-of-concept experiments, and compare our policy with existing baseline algorithms. Our results show that TSEETC outperforms existing algorithms in regret and the regret order is consistent with our theoretical result.
Researcher Affiliation Academia 1Shanghai Jiao Tong University, Shanghai, China. 2SUNYBinghamton University, Binghamton, NY, USA. 3Communication University of China, Beijing, China. 4Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing, China.
Pseudocode Yes Algorithm 1 Posterior Update for Ri(s, ) and P i(s, ). Algorithm 2 Thompson Sampling with Episodic Explore Then-Commit.
Open Source Code No The paper does not provide any explicit statements about the release of source code or links to a code repository for the described methodology.
Open Datasets No The paper describes a simulated environment: 'We consider two arms and there are two hidden states (0 and 1) for each arm.' It does not use or provide access information for a pre-existing public dataset.
Dataset Splits No The paper describes a simulation setup ('The learning horizon T = 50000, and each algorithm runs 100 iterations.') but does not specify explicit training, validation, or test dataset splits, as would be typical for experiments on pre-existing datasets.
Hardware Specification No The paper mentions running 'simulations' and 'experiments' but does not specify any details about the hardware (e.g., CPU, GPU models, memory) used to conduct these.
Software Dependencies No The paper mentions various baseline algorithms (e.g., 'ϵ-greedy', 'Sliding-Window UCB', 'RUCB', 'Q-learning', 'SEEU') and cites papers related to them, but it does not specify the versions of any software libraries, programming languages, or environments used for implementation or experimentation.
Experiment Setup Yes The learning horizon T = 50000, and each algorithm runs 100 iterations. At state 1, the reward set is {10, 20} and the reward set is { 10, 10} at state 0. We initialize the algorithm with uninformed Dirichlet prior on the unknown parameters. The baselines include ϵ-greedy (Lattimore & Szepesv ari, 2020) with ϵ = 0.01, Sliding-Window UCB (Garivier & Moulines, 2011) with specified window size( equal to 50).