Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

Authors: Pihe Hu, Yu Chen, Longbo Huang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear with respect to a feature mapping ϕps, aq. Specifically, we consider the episodic inhomogeneous linear Markov Decision Process (MDP), and propose a novel computation-efficient algorithm, LSVI-UCB , which achieves an r Op Hd ? Tq regret bound where H is the episode length, d is the feature dimension, and T is the number of steps. LSVI-UCB builds on weighted ridge regression and upper confidence value iteration with a Bernstein-type exploration bonus. Our statistical results are obtained with novel analytical tools, including a new Bernstein self-normalized bound with conservatism on elliptical potentials, and refined analysis of the correction term.
Researcher Affiliation Academia 1Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China 2Department of Mathematical Sciences, Tsinghua University, Beijing, China.
Pseudocode Yes Algorithm 1 LSVI-UCB for Linear MDPs
Open Source Code No The paper does not contain any statements about making code publicly available or links to a code repository.
Open Datasets No The paper is theoretical and focuses on algorithm design and regret bounds for linear MDPs. It does not mention the use of any specific publicly available datasets for training or evaluation.
Dataset Splits No The paper is theoretical and does not involve empirical experiments with data splits. There is no mention of training, validation, or test dataset splits.
Hardware Specification No The paper describes a theoretical algorithm and its statistical properties. It does not mention any hardware specifications used for running experiments.
Software Dependencies No The paper is theoretical and focuses on algorithm design and proofs. It does not specify any software dependencies with version numbers.
Experiment Setup No The paper is theoretical and provides algorithmic details and proofs for regret bounds. It does not include an experimental setup section with hyperparameters or training settings.