Pessimistic Nonlinear Least-Squares Value Iteration for Offline Reinforcement Learning
Authors: Qiwei Di, Heyang Zhao, Jiafan He, Quanquan Gu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this section, we prove an instance-dependent regret bound of Algorithm 1. Our algorithmic design comprises three innovative components: (1) a variance-based weighted regression scheme that can be applied to a wide range of function classes, (2) a subroutine for variance estimation, and (3) a planning phase that utilizes a pessimistic value iteration approach. Our algorithm enjoys a regret bound that has a tight dependency on the function class complexity and achieves minimax optimal instance-dependent regret when specialized to linear function approximation. |
| Researcher Affiliation | Academia | Qiwei Di1, Heyang Zhao1, Jiafan he1, Quanquan Gu1 1Department of Computer Science, University of California, Los Angeles {qiwei2000,hyzhao,jiafanhe19,qgu}@cs.ucla.edu |
| Pseudocode | Yes | Algorithm 1 Pessimistic Nonlinear Least-Squares Value Iteration (PNLSVI) |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository. |
| Open Datasets | No | The paper describes using a 'batch-dataset D' for offline RL, but does not name any specific public datasets or provide access information for any dataset used. |
| Dataset Splits | No | The paper is theoretical and does not report on experiments with dataset splits, so no training/validation/test splits are mentioned. |
| Hardware Specification | No | The paper is theoretical and does not report on experiments, thus no hardware specifications are provided. |
| Software Dependencies | No | The paper is theoretical and does not report on experiments, thus no software dependencies with version numbers are listed. |
| Experiment Setup | No | The paper focuses on theoretical algorithm design and analysis, and does not provide details about an experimental setup, such as hyperparameters or training settings. |