Online Restless Bandits with Unobserved States
Authors: Bowen Jiang, Bo Jiang, Jian Li, Tao Lin, Xinbing Wang, Chenghu Zhou
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show through simulations that TSEETC outperforms existing algorithms in regret. We conduct the proof-of-concept experiments, and compare our policy with existing baseline algorithms. Our results show that TSEETC outperforms existing algorithms in regret and the regret order is consistent with our theoretical result. |
| Researcher Affiliation | Academia | 1Shanghai Jiao Tong University, Shanghai, China. 2SUNYBinghamton University, Binghamton, NY, USA. 3Communication University of China, Beijing, China. 4Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing, China. |
| Pseudocode | Yes | Algorithm 1 Posterior Update for Ri(s, ) and P i(s, ). Algorithm 2 Thompson Sampling with Episodic Explore Then-Commit. |
| Open Source Code | No | The paper does not provide any explicit statements about the release of source code or links to a code repository for the described methodology. |
| Open Datasets | No | The paper describes a simulated environment: 'We consider two arms and there are two hidden states (0 and 1) for each arm.' It does not use or provide access information for a pre-existing public dataset. |
| Dataset Splits | No | The paper describes a simulation setup ('The learning horizon T = 50000, and each algorithm runs 100 iterations.') but does not specify explicit training, validation, or test dataset splits, as would be typical for experiments on pre-existing datasets. |
| Hardware Specification | No | The paper mentions running 'simulations' and 'experiments' but does not specify any details about the hardware (e.g., CPU, GPU models, memory) used to conduct these. |
| Software Dependencies | No | The paper mentions various baseline algorithms (e.g., 'ϵ-greedy', 'Sliding-Window UCB', 'RUCB', 'Q-learning', 'SEEU') and cites papers related to them, but it does not specify the versions of any software libraries, programming languages, or environments used for implementation or experimentation. |
| Experiment Setup | Yes | The learning horizon T = 50000, and each algorithm runs 100 iterations. At state 1, the reward set is {10, 20} and the reward set is { 10, 10} at state 0. We initialize the algorithm with uninformed Dirichlet prior on the unknown parameters. The baselines include ϵ-greedy (Lattimore & Szepesv ari, 2020) with ϵ = 0.01, Sliding-Window UCB (Garivier & Moulines, 2011) with specified window size( equal to 50). |