Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

Authors: Ming Yin, Yu Bai, Yu-Xiang Wang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we propose Off-Policy Double Variance Reduction (OPDVR), a new variance reduction based algorithm for offline RL. Our main result shows that OPDVR provably identifies an ϵ-optimal policy with e O(H2/dmϵ2) episodes of offline data in the finite-horizon stationary transition setting... Moreover, we establish an informationtheoretic lower bound of Ω(H2/dmϵ2) which certifies that OPDVR is optimal up to logarithmic factors.
Researcher Affiliation Collaboration Ming Yin 1,3, Yu Bai2, and Yu-Xiang Wang1 1Department of Computer Science, UC Santa Barbara 2Salesforce Research 3Department of Statistics and Applied Probability, UC Santa Barbara
Pseudocode Yes Algorithm 1 OPVRT: A Prototypical Off-Policy Variance Reduction Template; Algorithm 2 (OPDVR) Off-Policy Doubled Variance Reduction
Open Source Code No The paper does not provide any explicit statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets No The paper refers to using a 'static offline dataset D' obtained by executing a 'pre-specified behavior policy µ', but does not name a publicly available dataset or provide any access information (link, DOI, specific citation with authors/year) for a dataset used for training.
Dataset Splits No The paper does not provide specific information regarding training, validation, or test dataset splits. It is a theoretical paper focusing on algorithms and sample complexity.
Hardware Specification No The paper does not mention any specific hardware used for running experiments. It is a theoretical paper.
Software Dependencies No The paper does not list specific software dependencies with version numbers. It focuses on theoretical algorithms and proofs.
Experiment Setup No The paper is theoretical and does not provide details about an experimental setup, such as hyperparameters or specific training configurations.