Worst-Case Offline Reinforcement Learning with Arbitrary Data Support

Authors: Kohei Miyaguchi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We propose a method of offline reinforcement learning (RL) featuring the performance guarantee without any assumptions on the data support. Under such conditions, estimating or optimizing the conventional performance metric is generally infeasible due to the distributional discrepancy between data and target policy distributions. To address this issue, we employ a worst-case policy value as a new metric and constructively show that the sample complexity bound of O(ϵ 2) is attainable without any data-support conditions, where ϵ > 0 is the policy suboptimality in the new metric. Moreover, as the new metric generalizes the conventional one, the algorithm can address standard offline RL tasks without modification. In this context, our sample complexity bound can be seen as a strict improvement on the previous bounds under the single-policy concentrability and the single-policy realizability.
Researcher Affiliation Industry Kohei Miyaguchi IBM Research Tokyo Tokyo, Japan koheimiyaguchi@gmail.com The author is affiliated with LY Corporation at the time of publication.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks. The methods are described mathematically.
Open Source Code No The paper is theoretical and does not mention providing open-source code for the methodology it describes.
Open Datasets No The paper is theoretical and does not report empirical experiments using specific datasets. It mentions a conceptual "offline dataset D" but provides no concrete access information or citation for a publicly available dataset.
Dataset Splits No The paper is theoretical and does not report empirical experiments, thus no training/test/validation dataset splits are specified.
Hardware Specification No The paper is purely theoretical and does not describe any hardware specifications used for running experiments.
Software Dependencies No The paper is purely theoretical and does not describe any specific software dependencies with version numbers used for experiments.
Experiment Setup No The paper is purely theoretical and does not provide details about an experimental setup, such as hyperparameters or training settings.