Towards Instance-Optimal Offline Reinforcement Learning with Pessimism
Authors: Ming Yin, Yu-Xiang Wang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We study the offline reinforcement learning (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown Markov Decision Process (MDP) using the data coming from a policy µ. In particular, we consider the sample complexity problems of offline RL for finite-horizon MDPs. Prior works study this problem based on different data-coverage assumptions, and their learning guarantees are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive the suboptimality upper bound that nearly matches... In complementary, we also prove a per-instance information-theoretical lower bound under the weak assumption that dµ h(sh, ah) > 0 if dπ h (sh, ah) > 0. Different from the previous minimax lower bounds, the per-instance lower bound (via local minimaxity) is a much stronger criterion as it applies to individual instances separately. |
| Researcher Affiliation | Academia | Ming Yin 1,2 and Yu-Xiang Wang1 1Department of Computer Science, UC Santa Barbara 2Department of Statistics and Applied Probability, UC Santa Barbara |
| Pseudocode | Yes | Algorithm 1 Adaptive (assumption-free) Pessimistic Value Iteration or LCBVI-Bernstein |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that source code for the described methodology is publicly available. |
| Open Datasets | No | The paper is theoretical and focuses on algorithm analysis and deriving bounds. It does not mention or use any specific publicly available datasets for training, nor does it provide access information for any dataset. |
| Dataset Splits | No | The paper is theoretical and does not involve empirical experiments with data splits for training, validation, or testing. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU/CPU models, processors, or cloud computing specifications used for running experiments. It is a theoretical paper. |
| Software Dependencies | No | The paper does not list any specific software dependencies or their version numbers (e.g., programming languages, libraries, frameworks, or solvers). |
| Experiment Setup | No | The paper is theoretical and does not describe an experimental setup with concrete hyperparameter values, training configurations, or system-level settings for empirical evaluation. |