Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Near-Optimal Offline Reinforcement Learning via Double Variance Reduction
Authors: Ming Yin, Yu Bai, Yu-Xiang Wang
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this paper, we propose Off-Policy Double Variance Reduction (OPDVR), a new variance reduction based algorithm for offline RL. Our main result shows that OPDVR provably identifies an ϵ-optimal policy with e O(H2/dmϵ2) episodes of offline data in the finite-horizon stationary transition setting... Moreover, we establish an informationtheoretic lower bound of Ω(H2/dmϵ2) which certifies that OPDVR is optimal up to logarithmic factors. |
| Researcher Affiliation | Collaboration | Ming Yin 1,3, Yu Bai2, and Yu-Xiang Wang1 1Department of Computer Science, UC Santa Barbara 2Salesforce Research 3Department of Statistics and Applied Probability, UC Santa Barbara |
| Pseudocode | Yes | Algorithm 1 OPVRT: A Prototypical Off-Policy Variance Reduction Template; Algorithm 2 (OPDVR) Off-Policy Doubled Variance Reduction |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | No | The paper refers to using a 'static offline dataset D' obtained by executing a 'pre-specified behavior policy µ', but does not name a publicly available dataset or provide any access information (link, DOI, specific citation with authors/year) for a dataset used for training. |
| Dataset Splits | No | The paper does not provide specific information regarding training, validation, or test dataset splits. It is a theoretical paper focusing on algorithms and sample complexity. |
| Hardware Specification | No | The paper does not mention any specific hardware used for running experiments. It is a theoretical paper. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers. It focuses on theoretical algorithms and proofs. |
| Experiment Setup | No | The paper is theoretical and does not provide details about an experimental setup, such as hyperparameters or specific training configurations. |