Provably Efficient Q-Learning with Low Switching Cost

Authors: Yu Bai, Tengyang Xie, Nan Jiang, Yu-Xiang Wang

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical Our main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for H-step episodic MDP that achieves sublinear regret whose local switching cost in K episodes is O(H3SA log K), and we provide a lower bound of Ω(HSA) on the local switching cost for any no-regret algorithm. Our algorithm can be naturally adapted to the concurrent setting [13], which yields nontrivial results that improve upon prior work in certain aspects.
Researcher Affiliation Academia Yu Bai Stanford University yub@stanford.edu Tengyang Xie Nan Jiang UIUC {tx10, nanjiang}@illinois.edu Yu-Xiang Wang UC Santa Barbara yuxiangw@cs.ucsb.edu
Pseudocode Yes Algorithm 1 UCB2 for multi-armed bandits Initialize: rj = 0 for j = 1, . . . , A. Play each arm once. Set t 0 and T T A. while t T do Select arm j that maximizes rj + arj, where rj is the average reward obtained from arm j and ar = O( p log T/τ(r)) (with some specific choice.) Play arm j exactly τ(rj + 1) τ(rj) times. Set t t + τ(rj + 1) τ(rj) and rj rj + 1. end while
Open Source Code No The paper does not contain any statement about making the source code available, nor does it provide a link to a code repository.
Open Datasets No The paper is theoretical and focuses on algorithm design and proofs for Reinforcement Learning within episodic tabular MDPs. It does not involve experimental evaluation on a dataset, thus there is no mention of a publicly available dataset for training.
Dataset Splits No The paper is theoretical and focuses on algorithm design and proofs. It does not describe empirical experiments requiring dataset splits for training, validation, or testing.
Hardware Specification No The paper is theoretical and does not describe running empirical experiments, thus no hardware specifications are mentioned.
Software Dependencies No The paper is theoretical and does not describe running empirical experiments. Therefore, it does not list specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and focuses on algorithm design and proofs, rather than empirical evaluation. Therefore, it does not provide details on experimental setup, hyperparameters, or training configurations.