reproducibilityindex.ai

Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost

Authors: Dan Qiao, Ming Yin, Ming Min, Yu-Xiang Wang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of e O( H4S2AT) while requiring a switching cost of O(HSA log log T). This is an exponential improvement over the best-known switching cost O(H2SA log T) among existing methods with e O(poly(H, S, A) T) regret. In the above, S, A denotes the number of states and actions in an H-horizon episodic Markov Decision Process model with unknown transitions, and T is the number of steps. As a byproduct of our new techniques, we also derive a reward-free exploration algorithm with a switching cost of O(HSA). Furthermore, we prove a pair of information-theoretical lower bounds which say that (1) Any no-regret algorithm must have a switching cost of Ω(HSA); (2) Any e O( T) regret algorithm must incur a switching cost of Ω(HSA log log T). Both our algorithms are thus optimal in their switching costs.
Researcher Affiliation	Academia	1Department of Computer Science, UC Santa Barbara 2Department of Statistics and Applied Probability, UC Santa Barbara.
Pseudocode	Yes	Algorithm 1 Adaptive Policy Elimination by Value Estimation (APEVE)
Open Source Code	No	The paper does not provide any links or explicit statements about the availability of open-source code.
Open Datasets	No	The paper is theoretical and does not describe experiments using datasets.
Dataset Splits	No	The paper is theoretical and does not describe experiments with dataset splits.
Hardware Specification	No	The paper is theoretical and does not describe any hardware used for experiments.
Software Dependencies	No	The paper is theoretical and does not describe specific software dependencies with version numbers.
Experiment Setup	No	The paper is theoretical and does not describe specific experimental setup details such as hyperparameters.