Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation

Authors: Jiafan He, Dongruo Zhou, Quanquan Gu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability. The uniform-PAC guarantee is the strongest possible guarantee for reinforcement learning in the literature, which can directly imply both PAC and high probability regret bounds, making our algorithm superior to all existing algorithms with linear function approximation. At the core of our algorithm is a novel minimax value function estimator and a multi-level partition scheme to select the training samples from historical observations. Both of these techniques are new and of independent interest. We propose new algorithms for both contextual linear bandits and linear Markov decision processes (MDPs) [22, 13]. Both of them are uniform-PAC, and their sample complexity is comparable to that of the state-of-the-art algorithms which are not uniform-PAC. Our key contributions are highlighted as follows.
Researcher Affiliation Academia Jiafan He Department of Computer Science University of California, Los Angeles CA 90095, USA jiafanhe19@ucla.edu Dongruo Zhou Department of Computer Science University of California, Los Angeles CA 90095, USA drzhou@cs.ucla.edu Quanquan Gu Department of Computer Science University of California, Los Angeles CA 90095, USA qgu@cs.ucla.edu
Pseudocode Yes Algorithm 1 Uniform-PAC OFUL (UPAC-OFUL) ... Algorithm 2 Uniform PAC Least-Square Value Iteration (FLUTE)
Open Source Code No The paper does not provide any explicit statement or link regarding the availability of open-source code for the described methodology.
Open Datasets No The paper focuses on theoretical guarantees and algorithm design for reinforcement learning with linear function approximation, and does not involve experiments with a specific dataset.
Dataset Splits No The paper does not conduct empirical experiments using datasets, therefore, there is no mention of validation splits.
Hardware Specification No The paper is theoretical and does not describe any specific hardware used for experiments.
Software Dependencies No The paper is theoretical and does not list any specific software dependencies or versions required for replication.
Experiment Setup No The paper focuses on theoretical algorithm design and analysis, and thus does not include details on experimental setup or hyperparameters.