reproducibilityindex.ai

Provably Efficient Q-Learning with Low Switching Cost

Authors: Yu Bai, Tengyang Xie, Nan Jiang, Yu-Xiang Wang

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	Our main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for H-step episodic MDP that achieves sublinear regret whose local switching cost in K episodes is O(H3SA log K), and we provide a lower bound of Ω(HSA) on the local switching cost for any no-regret algorithm. Our algorithm can be naturally adapted to the concurrent setting [13], which yields nontrivial results that improve upon prior work in certain aspects.
Researcher Affiliation	Academia	Yu Bai Stanford University yub@stanford.edu Tengyang Xie Nan Jiang UIUC {tx10, nanjiang}@illinois.edu Yu-Xiang Wang UC Santa Barbara yuxiangw@cs.ucsb.edu
Pseudocode	Yes	Algorithm 1 UCB2 for multi-armed bandits Initialize: rj = 0 for j = 1, . . . , A. Play each arm once. Set t 0 and T T A. while t T do Select arm j that maximizes rj + arj, where rj is the average reward obtained from arm j and ar = O( p log T/τ(r)) (with some speciﬁc choice.) Play arm j exactly τ(rj + 1) τ(rj) times. Set t t + τ(rj + 1) τ(rj) and rj rj + 1. end while
Open Source Code	No	The paper does not contain any statement about making the source code available, nor does it provide a link to a code repository.
Open Datasets	No	The paper is theoretical and focuses on algorithm design and proofs for Reinforcement Learning within episodic tabular MDPs. It does not involve experimental evaluation on a dataset, thus there is no mention of a publicly available dataset for training.
Dataset Splits	No	The paper is theoretical and focuses on algorithm design and proofs. It does not describe empirical experiments requiring dataset splits for training, validation, or testing.
Hardware Specification	No	The paper is theoretical and does not describe running empirical experiments, thus no hardware specifications are mentioned.
Software Dependencies	No	The paper is theoretical and does not describe running empirical experiments. Therefore, it does not list specific software dependencies with version numbers.
Experiment Setup	No	The paper is theoretical and focuses on algorithm design and proofs, rather than empirical evaluation. Therefore, it does not provide details on experimental setup, hyperparameters, or training configurations.