Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation
Authors: Pihe Hu, Yu Chen, Longbo Huang
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear with respect to a feature mapping ϕps, aq. Specifically, we consider the episodic inhomogeneous linear Markov Decision Process (MDP), and propose a novel computation-efficient algorithm, LSVI-UCB , which achieves an r Op Hd ? Tq regret bound where H is the episode length, d is the feature dimension, and T is the number of steps. LSVI-UCB builds on weighted ridge regression and upper confidence value iteration with a Bernstein-type exploration bonus. Our statistical results are obtained with novel analytical tools, including a new Bernstein self-normalized bound with conservatism on elliptical potentials, and refined analysis of the correction term. |
| Researcher Affiliation | Academia | 1Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China 2Department of Mathematical Sciences, Tsinghua University, Beijing, China. |
| Pseudocode | Yes | Algorithm 1 LSVI-UCB for Linear MDPs |
| Open Source Code | No | The paper does not contain any statements about making code publicly available or links to a code repository. |
| Open Datasets | No | The paper is theoretical and focuses on algorithm design and regret bounds for linear MDPs. It does not mention the use of any specific publicly available datasets for training or evaluation. |
| Dataset Splits | No | The paper is theoretical and does not involve empirical experiments with data splits. There is no mention of training, validation, or test dataset splits. |
| Hardware Specification | No | The paper describes a theoretical algorithm and its statistical properties. It does not mention any hardware specifications used for running experiments. |
| Software Dependencies | No | The paper is theoretical and focuses on algorithm design and proofs. It does not specify any software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and provides algorithmic details and proofs for regret bounds. It does not include an experimental setup section with hyperparameters or training settings. |