reproducibilityindex.ai

Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

Authors: Zihan Zhang, Yuan Zhou, Xiangyang Ji

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this paper we consider the problem of learning an -optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with S states, A actions, the discount factor γ P p0, 1q, and an approximation threshold 0, we provide a model-free algorithm to learn an -optimal policy with sample complexity Op SA lnp1{pq 2p1 γq5.5 q 1 and success probability p1 pq. For small enough , we show an improved algorithm with sample complexity Op SA lnp1{pq 2p1 γq3 q. While the ﬁrst bound improves upon all known model-free algorithms and model-based ones with tight dependence on S, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.
Researcher Affiliation	Academia	1Tsinghua University 2University of Illinois Urbana Chanmpion. Correspondence to: Zihan Zhang <zihanzh17@mails.tsinghua.edu.cn>, Yuan Zhou <yuanz@illinois.edu>, Xiangyang Ji <xyji@tsinghua.edu.cn>.
Pseudocode	Yes	Algorithm 1 UCB-MULTISTAGE
Open Source Code	No	The paper does not provide any statement or link indicating that source code for the described methodology is available.
Open Datasets	No	The paper is theoretical and does not involve training models on specific datasets. It deals with abstract MDPs with S states and A actions, and discusses sample complexity in a theoretical context.
Dataset Splits	No	The paper is theoretical and does not involve empirical validation on datasets, thus no dataset splits for training, validation, or testing are provided.
Hardware Specification	No	The paper is theoretical and does not describe any experimental setup that would require specific hardware. No hardware specifications are mentioned.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers. It refers to general RL algorithms and concepts but not concrete software for implementation or experimentation.
Experiment Setup	No	The paper is theoretical and does not involve empirical experiments with specific hyperparameters or system-level training settings.