Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping
Authors: Dongruo Zhou, Jiafan He, Quanquan Gu
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We propose a novel algorithm which makes use of the feature mapping and obtains a e O(d T/(1 γ)2) regret, where d is the dimension of the feature space, T is the time horizon and γ is the discount factor of the MDP. To the best of our knowledge, this is the first polynomial regret bound without accessing to a generative model or making strong assumptions such as ergodicity of the MDP. By constructing a special class of MDPs, we also show that for any algorithms, the regret is lower bounded by Ω(d T/(1 γ)1.5). Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a (1 γ) 0.5 factor. |
| Researcher Affiliation | Academia | Dongruo Zhou 1 Jiafan He 1 Quanquan Gu 1 1Department of Computer Science, University of California, Los Angeles, CA 90095, USA. |
| Pseudocode | Yes | Algorithm 1 Upper-Confidence Linear Kernel Reinforcement Learning (UCLK) Algorithm 2 Extended Value Iteration: EVI(C, U) |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | No | This paper focuses on theoretical analysis of a reinforcement learning algorithm for a defined class of MDPs and does not utilize a traditional public dataset for training or evaluation. |
| Dataset Splits | No | This paper is theoretical and does not conduct experiments involving dataset splits (training, validation, test). |
| Hardware Specification | No | The paper does not describe any specific hardware used for experiments, as it is a theoretical work. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers, as it is a theoretical work and does not describe empirical implementations. |
| Experiment Setup | No | The paper is theoretical and does not describe any experimental setup details such as hyperparameters or training configurations. |