Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

Authors: Zihan Zhang, Yuan Zhou, Xiangyang Ji

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper we consider the problem of learning an -optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with S states, A actions, the discount factor γ P p0, 1q, and an approximation threshold 0, we provide a model-free algorithm to learn an -optimal policy with sample complexity Op SA lnp1{pq 2p1 γq5.5 q 1 and success probability p1 pq. For small enough , we show an improved algorithm with sample complexity Op SA lnp1{pq 2p1 γq3 q. While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on S, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.
Researcher Affiliation Academia 1Tsinghua University 2University of Illinois Urbana Chanmpion. Correspondence to: Zihan Zhang <zihanzh17@mails.tsinghua.edu.cn>, Yuan Zhou <yuanz@illinois.edu>, Xiangyang Ji <xyji@tsinghua.edu.cn>.
Pseudocode Yes Algorithm 1 UCB-MULTISTAGE
Open Source Code No The paper does not provide any statement or link indicating that source code for the described methodology is available.
Open Datasets No The paper is theoretical and does not involve training models on specific datasets. It deals with abstract MDPs with S states and A actions, and discusses sample complexity in a theoretical context.
Dataset Splits No The paper is theoretical and does not involve empirical validation on datasets, thus no dataset splits for training, validation, or testing are provided.
Hardware Specification No The paper is theoretical and does not describe any experimental setup that would require specific hardware. No hardware specifications are mentioned.
Software Dependencies No The paper does not list specific software dependencies with version numbers. It refers to general RL algorithms and concepts but not concrete software for implementation or experimentation.
Experiment Setup No The paper is theoretical and does not involve empirical experiments with specific hyperparameters or system-level training settings.