Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation
Authors: Jiafan He, Dongruo Zhou, Quanquan Gu
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability. The uniform-PAC guarantee is the strongest possible guarantee for reinforcement learning in the literature, which can directly imply both PAC and high probability regret bounds, making our algorithm superior to all existing algorithms with linear function approximation. At the core of our algorithm is a novel minimax value function estimator and a multi-level partition scheme to select the training samples from historical observations. Both of these techniques are new and of independent interest. We propose new algorithms for both contextual linear bandits and linear Markov decision processes (MDPs) [22, 13]. Both of them are uniform-PAC, and their sample complexity is comparable to that of the state-of-the-art algorithms which are not uniform-PAC. Our key contributions are highlighted as follows. |
| Researcher Affiliation | Academia | Jiafan He Department of Computer Science University of California, Los Angeles CA 90095, USA jiafanhe19@ucla.edu Dongruo Zhou Department of Computer Science University of California, Los Angeles CA 90095, USA drzhou@cs.ucla.edu Quanquan Gu Department of Computer Science University of California, Los Angeles CA 90095, USA qgu@cs.ucla.edu |
| Pseudocode | Yes | Algorithm 1 Uniform-PAC OFUL (UPAC-OFUL) ... Algorithm 2 Uniform PAC Least-Square Value Iteration (FLUTE) |
| Open Source Code | No | The paper does not provide any explicit statement or link regarding the availability of open-source code for the described methodology. |
| Open Datasets | No | The paper focuses on theoretical guarantees and algorithm design for reinforcement learning with linear function approximation, and does not involve experiments with a specific dataset. |
| Dataset Splits | No | The paper does not conduct empirical experiments using datasets, therefore, there is no mention of validation splits. |
| Hardware Specification | No | The paper is theoretical and does not describe any specific hardware used for experiments. |
| Software Dependencies | No | The paper is theoretical and does not list any specific software dependencies or versions required for replication. |
| Experiment Setup | No | The paper focuses on theoretical algorithm design and analysis, and thus does not include details on experimental setup or hyperparameters. |