Near-Optimal Offline Reinforcement Learning via Double Variance Reduction
Authors: Ming Yin, Yu Bai, Yu-Xiang Wang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we propose Off-Policy Double Variance Reduction (OPDVR), a new variance reduction based algorithm for offline RL. Our main result shows that OPDVR provably identifies an ϵ-optimal policy with e O(H2/dmϵ2) episodes of offline data in the finite-horizon stationary transition setting... Moreover, we establish an informationtheoretic lower bound of Ω(H2/dmϵ2) which certifies that OPDVR is optimal up to logarithmic factors. |
| Researcher Affiliation | Collaboration | Ming Yin 1,3, Yu Bai2, and Yu-Xiang Wang1 1Department of Computer Science, UC Santa Barbara 2Salesforce Research 3Department of Statistics and Applied Probability, UC Santa Barbara |
| Pseudocode | Yes | Algorithm 1 OPVRT: A Prototypical Off-Policy Variance Reduction Template; Algorithm 2 (OPDVR) Off-Policy Doubled Variance Reduction |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | No | The paper refers to using a 'static offline dataset D' obtained by executing a 'pre-specified behavior policy µ', but does not name a publicly available dataset or provide any access information (link, DOI, specific citation with authors/year) for a dataset used for training. |
| Dataset Splits | No | The paper does not provide specific information regarding training, validation, or test dataset splits. It is a theoretical paper focusing on algorithms and sample complexity. |
| Hardware Specification | No | The paper does not mention any specific hardware used for running experiments. It is a theoretical paper. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers. It focuses on theoretical algorithms and proofs. |
| Experiment Setup | No | The paper is theoretical and does not provide details about an experimental setup, such as hyperparameters or specific training configurations. |