Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost

Authors: Dan Qiao, Ming Yin, Ming Min, Yu-Xiang Wang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of e O( H4S2AT) while requiring a switching cost of O(HSA log log T). This is an exponential improvement over the best-known switching cost O(H2SA log T) among existing methods with e O(poly(H, S, A) T) regret. In the above, S, A denotes the number of states and actions in an H-horizon episodic Markov Decision Process model with unknown transitions, and T is the number of steps. As a byproduct of our new techniques, we also derive a reward-free exploration algorithm with a switching cost of O(HSA). Furthermore, we prove a pair of information-theoretical lower bounds which say that (1) Any no-regret algorithm must have a switching cost of Ω(HSA); (2) Any e O( T) regret algorithm must incur a switching cost of Ω(HSA log log T). Both our algorithms are thus optimal in their switching costs.
Researcher Affiliation Academia 1Department of Computer Science, UC Santa Barbara 2Department of Statistics and Applied Probability, UC Santa Barbara.
Pseudocode Yes Algorithm 1 Adaptive Policy Elimination by Value Estimation (APEVE)
Open Source Code No The paper does not provide any links or explicit statements about the availability of open-source code.
Open Datasets No The paper is theoretical and does not describe experiments using datasets.
Dataset Splits No The paper is theoretical and does not describe experiments with dataset splits.
Hardware Specification No The paper is theoretical and does not describe any hardware used for experiments.
Software Dependencies No The paper is theoretical and does not describe specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not describe specific experimental setup details such as hyperparameters.