Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost
Authors: Dan Qiao, Ming Yin, Ming Min, Yu-Xiang Wang
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of e O( H4S2AT) while requiring a switching cost of O(HSA log log T). This is an exponential improvement over the best-known switching cost O(H2SA log T) among existing methods with e O(poly(H, S, A) T) regret. In the above, S, A denotes the number of states and actions in an H-horizon episodic Markov Decision Process model with unknown transitions, and T is the number of steps. As a byproduct of our new techniques, we also derive a reward-free exploration algorithm with a switching cost of O(HSA). Furthermore, we prove a pair of information-theoretical lower bounds which say that (1) Any no-regret algorithm must have a switching cost of Ω(HSA); (2) Any e O( T) regret algorithm must incur a switching cost of Ω(HSA log log T). Both our algorithms are thus optimal in their switching costs. |
| Researcher Affiliation | Academia | 1Department of Computer Science, UC Santa Barbara 2Department of Statistics and Applied Probability, UC Santa Barbara. |
| Pseudocode | Yes | Algorithm 1 Adaptive Policy Elimination by Value Estimation (APEVE) |
| Open Source Code | No | The paper does not provide any links or explicit statements about the availability of open-source code. |
| Open Datasets | No | The paper is theoretical and does not describe experiments using datasets. |
| Dataset Splits | No | The paper is theoretical and does not describe experiments with dataset splits. |
| Hardware Specification | No | The paper is theoretical and does not describe any hardware used for experiments. |
| Software Dependencies | No | The paper is theoretical and does not describe specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe specific experimental setup details such as hyperparameters. |