Variance-Aware Off-Policy Evaluation with Linear Function Approximation

Authors: Yifei Min, Tianhao Wang, Dongruo Zhou, Quanquan Gu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study the off-policy evaluation (OPE) problem in reinforcement learning with linear function approximation, which aims to estimate the value function of a target policy based on the offline data collected by a behavior policy. We propose to incorporate the variance information of the value function to improve the sample efficiency of OPE. More specifically, for time-inhomogeneous episodic linear Markov decision processes (MDPs), we propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration. We show that our algorithm achieves a tighter error bound than the best-known result. We also provide a fine-grained characterization of the distribution shift between the behavior policy and the target policy. Extensive numerical experiments corroborate our theory.
Researcher Affiliation Academia Yifei Min Department of Statistics and Data Science Yale University CT 06511 yifei.min@yale.edu Tianhao Wang Department of Statistics and Data Science Yale University CT 06511 tianhao.wang@yale.edu Dongruo Zhou Department of Computer Science University of California, Los Angeles CA 90095 drzhou@cs.ucla.edu Quanquan Gu Department of Computer Science University of California, Los Angeles CA 90095 qgu@cs.ucla.edu
Pseudocode Yes Algorithm 1 Variance-Aware Off-Policy Evaluation (VA-OPE)
Open Source Code Yes 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets No We construct a linear MDP instance as follows. The MDP has |S| = 2 states and |A| = 100 actions, with the feature dimension d = 10. The behavior policy then chooses action a = 0 with probability p and a ∈ {1, . . . , 99} with probability 1 − p and uniformly over {1, . . . , 99}. The target policy π always chooses a = 0 no matter which state it is, making state 0 and 1 absorbing. The parameter p can be used to control the distribution shift between the behavior and target policies. Here p ≈ 0 leads to small distribution shift, and p ≈ 1 leads to large distribution shift. The initial distribution ξ1 is uniform over |S|. For more details about the construction of the linear MDP and parameter configuration, please refer to Appendix A.
Dataset Splits No Assumptions 2.5 is standard in the offline RL literature [48, 10]. Note that in the assumption, there is a data splitting, i.e., one can view it as the whole dataset D ∪ ˇD being split into two halves. The datasets D and ˇD will then be used for two different purposes in Algorithm 1 as will be made clear in the next section. We would like to remark that the only purpose of the splitting is to avoid a lengthy analysis. There is no need to perform the data splitting in practice. Also, in our implementation and experiments, we do not split the data.
Hardware Specification No The paper does not explicitly provide specific hardware details such as GPU or CPU models, memory specifications, or cloud instance types used for running the experiments. While the ethics checklist states 'Yes' to including resource type, these details are not found in the main body of the paper.
Software Dependencies No The paper does not specify any software dependencies with version numbers, such as programming language versions, libraries, or frameworks (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes For more details about the construction of the linear MDP and parameter configuration, please refer to Appendix A. ... 3. If you ran experiments... (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes]