Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity
Authors: Zihan Zhang, Yuan Zhou, Xiangyang Ji
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper we consider the problem of learning an -optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with S states, A actions, the discount factor γ P p0, 1q, and an approximation threshold 0, we provide a model-free algorithm to learn an -optimal policy with sample complexity Op SA lnp1{pq 2p1 γq5.5 q 1 and success probability p1 pq. For small enough , we show an improved algorithm with sample complexity Op SA lnp1{pq 2p1 γq3 q. While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on S, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors. |
| Researcher Affiliation | Academia | 1Tsinghua University 2University of Illinois Urbana Chanmpion. Correspondence to: Zihan Zhang <zihanzh17@mails.tsinghua.edu.cn>, Yuan Zhou <yuanz@illinois.edu>, Xiangyang Ji <xyji@tsinghua.edu.cn>. |
| Pseudocode | Yes | Algorithm 1 UCB-MULTISTAGE |
| Open Source Code | No | The paper does not provide any statement or link indicating that source code for the described methodology is available. |
| Open Datasets | No | The paper is theoretical and does not involve training models on specific datasets. It deals with abstract MDPs with S states and A actions, and discusses sample complexity in a theoretical context. |
| Dataset Splits | No | The paper is theoretical and does not involve empirical validation on datasets, thus no dataset splits for training, validation, or testing are provided. |
| Hardware Specification | No | The paper is theoretical and does not describe any experimental setup that would require specific hardware. No hardware specifications are mentioned. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers. It refers to general RL algorithms and concepts but not concrete software for implementation or experimentation. |
| Experiment Setup | No | The paper is theoretical and does not involve empirical experiments with specific hyperparameters or system-level training settings. |