Semiparametrically Efficient Off-Policy Evaluation in Linear Markov Decision Processes

Authors: Chuhan Xie, Wenhao Yang, Zhihua Zhang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we implement simulation experiments to demonstrate the efficiency of our estimator and the validity of our proposed inference procedure.
Researcher Affiliation Academia 1School of Mathematical Sciences, Peking University, Beijing, China 2Academy of Advanced Interdisciplinary Studies, Peking University, Beijing, China.
Pseudocode Yes Algorithm 1 One-Step Estimator
Open Source Code No The paper does not provide any specific links to a code repository or explicit statements about the open-source availability of the methodology described in this paper.
Open Datasets No We consider a linear MDP with discrete state and action spaces, where |S| = 30, |A| = 10, d = 5 and γ = 0.8. The feature map {ϕ(s, a)}s S,a A is constructed by drawing i.i.d. Exp(1) random variables for each component of ϕ(s, a) and then normalizing it to satisfy Pd i=1 ϕi(s, a) = 1. The reward parameter ω0 has its components generated from i.i.d. Unif([0, 1]), and for each s S, the transition parameter ν0(s) has its components generated from i.i.d. Exp(1) followed by normalization to satisfy P s S ν0(s) = 1d. The initial state distribution is set as p(0) πb (s) = 1/30, s S. This indicates a synthetic data generation process rather than the use of a publicly available dataset with concrete access information.
Dataset Splits Yes We construct an estimator without sample splitting (i.e., all samples are used to construct nuisance estimates), a 2-fold sample splitting estimator and a 5-fold sample splitting estimator.
Hardware Specification No The paper does not provide specific details regarding the hardware (e.g., CPU/GPU models, memory, cloud instances) used for running the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific solver versions) that would be needed to replicate the experiment.
Experiment Setup Yes We consider a linear MDP with discrete state and action spaces, where |S| = 30, |A| = 10, d = 5 and γ = 0.8. The feature map {ϕ(s, a)}s S,a A is constructed by drawing i.i.d. Exp(1) random variables for each component of ϕ(s, a) and then normalizing it to satisfy Pd i=1 ϕi(s, a) = 1. The reward parameter ω0 has its components generated from i.i.d. Unif([0, 1]), and for each s S, the transition parameter ν0(s) has its components generated from i.i.d. Exp(1) followed by normalization to satisfy P s S ν0(s) = 1d. The feature map and true parameters are kept fixed once they are generated. Denoting S = {0, 1, . . . , 29} and A = {0, 1, . . . , 9}, we set the variance of the reward as Ω(s, a) = 1/100 + (10s + a)/600, and the behavior and target policies are defined as πb(a | s) = 0.2, if a s 1, 0.2, if a s, 0.6, if a s + 1, 0, otherwise, πe(a | s) = 0.1, s S, a A, where means equivalence in the sense of modulo 10. The initial state distribution is set as p(0) πb (s) = 1/30, s S. Our aim is to evaluate the value function at s0 = 0, i.e., vπe = Vπe(0). In the following, all simulation experiments are repeated by 1,000 times, and the number of samples used ranges from 5,000 to 100,000.