Logarithmic Regret for Reinforcement Learning with Linear Function Approximation
Authors: Jiafan He, Dongruo Zhou, Quanquan Gu
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we show that logarithmic regret is attainable under two recently proposed linear MDP assumptions provided that there exists a positive sub-optimality gap for the optimal actionvalue function. More specifically, under the linear MDP assumption (Jin et al., 2020), the LSVIUCB algorithm can achieve e O(d3H5/gapmin log(T))regret; and under the linear mixture MDP assumption (Ayoub et al., 2020), the UCRL-VTR algorithm can achieve e O(d2H5/gapmin log3(T)) regret, where d is the dimension of feature mapping, H is the length of episode, gapmin is the minimal sub-optimality gap, and e O hides all logarithmic terms except log(T). To the best of our knowledge, these are the first logarithmic regret bounds for RL with linear function approximation. We also establish gap-dependent lower bounds for the two linear MDP models. |
| Researcher Affiliation | Academia | Jiafan He 1 Dongruo Zhou 1 Quanquan Gu 1 1Department of Computer Science, University of California, Los Angeles, CA 90095, USA. |
| Pseudocode | Yes | Algorithm 1 Least Square Value-iteration with UCB (LSVIUCB) (Jin et al., 2020) and Algorithm 2 UCRL with Value-Targeted Model Estimation (UCRL-VTR) (Jia et al., 2020; Ayoub et al., 2020) |
| Open Source Code | No | No statement regarding the release or availability of open-source code for the methodology described in this paper was found. |
| Open Datasets | No | This paper focuses on theoretical analysis and does not use or reference any publicly available datasets for training or evaluation. |
| Dataset Splits | No | This paper focuses on theoretical analysis and does not report empirical experiments requiring dataset splits. |
| Hardware Specification | No | This paper focuses on theoretical analysis and does not report empirical experiments, thus no hardware specifications are mentioned. |
| Software Dependencies | No | This paper focuses on theoretical analysis and algorithm design. It does not report empirical experiments that would require specific software dependencies or versions. |
| Experiment Setup | No | This paper is theoretical in nature, focusing on algorithm analysis and proofs, and therefore does not include details about an experimental setup or hyperparameters. |