Provably Efficient Q-Learning with Low Switching Cost
Authors: Yu Bai, Tengyang Xie, Nan Jiang, Yu-Xiang Wang
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | Our main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for H-step episodic MDP that achieves sublinear regret whose local switching cost in K episodes is O(H3SA log K), and we provide a lower bound of Ω(HSA) on the local switching cost for any no-regret algorithm. Our algorithm can be naturally adapted to the concurrent setting [13], which yields nontrivial results that improve upon prior work in certain aspects. |
| Researcher Affiliation | Academia | Yu Bai Stanford University yub@stanford.edu Tengyang Xie Nan Jiang UIUC {tx10, nanjiang}@illinois.edu Yu-Xiang Wang UC Santa Barbara yuxiangw@cs.ucsb.edu |
| Pseudocode | Yes | Algorithm 1 UCB2 for multi-armed bandits Initialize: rj = 0 for j = 1, . . . , A. Play each arm once. Set t 0 and T T A. while t T do Select arm j that maximizes rj + arj, where rj is the average reward obtained from arm j and ar = O( p log T/τ(r)) (with some specific choice.) Play arm j exactly τ(rj + 1) τ(rj) times. Set t t + τ(rj + 1) τ(rj) and rj rj + 1. end while |
| Open Source Code | No | The paper does not contain any statement about making the source code available, nor does it provide a link to a code repository. |
| Open Datasets | No | The paper is theoretical and focuses on algorithm design and proofs for Reinforcement Learning within episodic tabular MDPs. It does not involve experimental evaluation on a dataset, thus there is no mention of a publicly available dataset for training. |
| Dataset Splits | No | The paper is theoretical and focuses on algorithm design and proofs. It does not describe empirical experiments requiring dataset splits for training, validation, or testing. |
| Hardware Specification | No | The paper is theoretical and does not describe running empirical experiments, thus no hardware specifications are mentioned. |
| Software Dependencies | No | The paper is theoretical and does not describe running empirical experiments. Therefore, it does not list specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and focuses on algorithm design and proofs, rather than empirical evaluation. Therefore, it does not provide details on experimental setup, hyperparameters, or training configurations. |