Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Provably Efficient Q-Learning with Low Switching Cost
Authors: Yu Bai, Tengyang Xie, Nan Jiang, Yu-Xiang Wang
NeurIPS 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | Our main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for H-step episodic MDP that achieves sublinear regret whose local switching cost in K episodes is O(H3SA log K), and we provide a lower bound of Ω(HSA) on the local switching cost for any no-regret algorithm. Our algorithm can be naturally adapted to the concurrent setting [13], which yields nontrivial results that improve upon prior work in certain aspects. |
| Researcher Affiliation | Academia | Yu Bai Stanford University EMAIL Tengyang Xie Nan Jiang UIUC EMAIL Yu-Xiang Wang UC Santa Barbara EMAIL |
| Pseudocode | Yes | Algorithm 1 UCB2 for multi-armed bandits Initialize: rj = 0 for j = 1, . . . , A. Play each arm once. Set t 0 and T T A. while t T do Select arm j that maximizes rj + arj, where rj is the average reward obtained from arm j and ar = O( p log T/τ(r)) (with some specific choice.) Play arm j exactly τ(rj + 1) τ(rj) times. Set t t + τ(rj + 1) τ(rj) and rj rj + 1. end while |
| Open Source Code | No | The paper does not contain any statement about making the source code available, nor does it provide a link to a code repository. |
| Open Datasets | No | The paper is theoretical and focuses on algorithm design and proofs for Reinforcement Learning within episodic tabular MDPs. It does not involve experimental evaluation on a dataset, thus there is no mention of a publicly available dataset for training. |
| Dataset Splits | No | The paper is theoretical and focuses on algorithm design and proofs. It does not describe empirical experiments requiring dataset splits for training, validation, or testing. |
| Hardware Specification | No | The paper is theoretical and does not describe running empirical experiments, thus no hardware specifications are mentioned. |
| Software Dependencies | No | The paper is theoretical and does not describe running empirical experiments. Therefore, it does not list specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and focuses on algorithm design and proofs, rather than empirical evaluation. Therefore, it does not provide details on experimental setup, hyperparameters, or training configurations. |