Nearly Horizon-Free Offline Reinforcement Learning
Authors: Tongzheng Ren, Jialian Li, Bo Dai, Simon S. Du, Sujay Sanghavi
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | For offline policy evaluation, we obtain an O error bound for the plug-in estimator, which matches the lower bound up to logarithmic factors and does not have additional dependency on poly (H, S, A, d) in higher-order term. For offline policy optimization, we obtain an O suboptimality gap for the empirical optimal policy, which approaches the lower bound up to logarithmic factors and a high-order term, improving upon the best known result by [1] that has additional poly (H, S, d) factors in the main term. To the best of our knowledge, these are the first set of nearly horizon-free bounds for episodic time-homogeneous offline tabular MDP and linear MDP with anchor points. |
| Researcher Affiliation | Collaboration | Tongzheng Ren1 Jialian Li2 Bo Dai3 Simon S. Du4 Sujay Sanghavi1, 5 1 UT Austin 2 Tsinghua University 3 Google Research, Brain Team 4 University of Washington 5 Amazon Search |
| Pseudocode | No | The paper does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not include any statement about releasing source code or provide any links to a code repository. |
| Open Datasets | No | The paper is theoretical and focuses on sample complexity bounds and theoretical guarantees. It discusses "collected K episodes data" as input for offline reinforcement learning, but it does not specify a publicly available or open dataset used for empirical training of a model, nor does it provide access information for such a dataset. |
| Dataset Splits | No | The paper is theoretical and does not describe experiments that would require dataset splits. Therefore, it does not provide information about training, validation, or test dataset splits. |
| Hardware Specification | No | The paper is theoretical and does not describe empirical experiments, thus it does not provide any hardware specifications used for running experiments. |
| Software Dependencies | No | The paper is theoretical and does not describe empirical experiments. Therefore, it does not list any specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe empirical experiments. Therefore, it does not provide specific experimental setup details, hyperparameters, or training configurations. |