reproducibilityindex.ai

Online Restless Bandits with Unobserved States

Authors: Bowen Jiang, Bo Jiang, Jian Li, Tao Lin, Xinbing Wang, Chenghu Zhou

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show through simulations that TSEETC outperforms existing algorithms in regret. We conduct the proof-of-concept experiments, and compare our policy with existing baseline algorithms. Our results show that TSEETC outperforms existing algorithms in regret and the regret order is consistent with our theoretical result.
Researcher Affiliation	Academia	1Shanghai Jiao Tong University, Shanghai, China. 2SUNYBinghamton University, Binghamton, NY, USA. 3Communication University of China, Beijing, China. 4Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing, China.
Pseudocode	Yes	Algorithm 1 Posterior Update for Ri(s, ) and P i(s, ). Algorithm 2 Thompson Sampling with Episodic Explore Then-Commit.
Open Source Code	No	The paper does not provide any explicit statements about the release of source code or links to a code repository for the described methodology.
Open Datasets	No	The paper describes a simulated environment: 'We consider two arms and there are two hidden states (0 and 1) for each arm.' It does not use or provide access information for a pre-existing public dataset.
Dataset Splits	No	The paper describes a simulation setup ('The learning horizon T = 50000, and each algorithm runs 100 iterations.') but does not specify explicit training, validation, or test dataset splits, as would be typical for experiments on pre-existing datasets.
Hardware Specification	No	The paper mentions running 'simulations' and 'experiments' but does not specify any details about the hardware (e.g., CPU, GPU models, memory) used to conduct these.
Software Dependencies	No	The paper mentions various baseline algorithms (e.g., 'ϵ-greedy', 'Sliding-Window UCB', 'RUCB', 'Q-learning', 'SEEU') and cites papers related to them, but it does not specify the versions of any software libraries, programming languages, or environments used for implementation or experimentation.
Experiment Setup	Yes	The learning horizon T = 50000, and each algorithm runs 100 iterations. At state 1, the reward set is {10, 20} and the reward set is { 10, 10} at state 0. We initialize the algorithm with uninformed Dirichlet prior on the unknown parameters. The baselines include ϵ-greedy (Lattimore & Szepesv ari, 2020) with ϵ = 0.01, Sliding-Window UCB (Garivier & Moulines, 2011) with specified window size( equal to 50).