Towards Instance-Optimal Offline Reinforcement Learning with Pessimism

Authors: Ming Yin, Yu-Xiang Wang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We study the offline reinforcement learning (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown Markov Decision Process (MDP) using the data coming from a policy µ. In particular, we consider the sample complexity problems of offline RL for finite-horizon MDPs. Prior works study this problem based on different data-coverage assumptions, and their learning guarantees are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive the suboptimality upper bound that nearly matches... In complementary, we also prove a per-instance information-theoretical lower bound under the weak assumption that dµ h(sh, ah) > 0 if dπ h (sh, ah) > 0. Different from the previous minimax lower bounds, the per-instance lower bound (via local minimaxity) is a much stronger criterion as it applies to individual instances separately.
Researcher Affiliation Academia Ming Yin 1,2 and Yu-Xiang Wang1 1Department of Computer Science, UC Santa Barbara 2Department of Statistics and Applied Probability, UC Santa Barbara
Pseudocode Yes Algorithm 1 Adaptive (assumption-free) Pessimistic Value Iteration or LCBVI-Bernstein
Open Source Code No The paper does not contain any explicit statements or links indicating that source code for the described methodology is publicly available.
Open Datasets No The paper is theoretical and focuses on algorithm analysis and deriving bounds. It does not mention or use any specific publicly available datasets for training, nor does it provide access information for any dataset.
Dataset Splits No The paper is theoretical and does not involve empirical experiments with data splits for training, validation, or testing.
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models, processors, or cloud computing specifications used for running experiments. It is a theoretical paper.
Software Dependencies No The paper does not list any specific software dependencies or their version numbers (e.g., programming languages, libraries, frameworks, or solvers).
Experiment Setup No The paper is theoretical and does not describe an experimental setup with concrete hyperparameter values, training configurations, or system-level settings for empirical evaluation.