Semiparametrically Efficient Off-Policy Evaluation in Linear Markov Decision Processes
Authors: Chuhan Xie, Wenhao Yang, Zhihua Zhang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we implement simulation experiments to demonstrate the efficiency of our estimator and the validity of our proposed inference procedure. |
| Researcher Affiliation | Academia | 1School of Mathematical Sciences, Peking University, Beijing, China 2Academy of Advanced Interdisciplinary Studies, Peking University, Beijing, China. |
| Pseudocode | Yes | Algorithm 1 One-Step Estimator |
| Open Source Code | No | The paper does not provide any specific links to a code repository or explicit statements about the open-source availability of the methodology described in this paper. |
| Open Datasets | No | We consider a linear MDP with discrete state and action spaces, where |S| = 30, |A| = 10, d = 5 and γ = 0.8. The feature map {ϕ(s, a)}s S,a A is constructed by drawing i.i.d. Exp(1) random variables for each component of ϕ(s, a) and then normalizing it to satisfy Pd i=1 ϕi(s, a) = 1. The reward parameter ω0 has its components generated from i.i.d. Unif([0, 1]), and for each s S, the transition parameter ν0(s) has its components generated from i.i.d. Exp(1) followed by normalization to satisfy P s S ν0(s) = 1d. The initial state distribution is set as p(0) πb (s) = 1/30, s S. This indicates a synthetic data generation process rather than the use of a publicly available dataset with concrete access information. |
| Dataset Splits | Yes | We construct an estimator without sample splitting (i.e., all samples are used to construct nuisance estimates), a 2-fold sample splitting estimator and a 5-fold sample splitting estimator. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware (e.g., CPU/GPU models, memory, cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific solver versions) that would be needed to replicate the experiment. |
| Experiment Setup | Yes | We consider a linear MDP with discrete state and action spaces, where |S| = 30, |A| = 10, d = 5 and γ = 0.8. The feature map {ϕ(s, a)}s S,a A is constructed by drawing i.i.d. Exp(1) random variables for each component of ϕ(s, a) and then normalizing it to satisfy Pd i=1 ϕi(s, a) = 1. The reward parameter ω0 has its components generated from i.i.d. Unif([0, 1]), and for each s S, the transition parameter ν0(s) has its components generated from i.i.d. Exp(1) followed by normalization to satisfy P s S ν0(s) = 1d. The feature map and true parameters are kept fixed once they are generated. Denoting S = {0, 1, . . . , 29} and A = {0, 1, . . . , 9}, we set the variance of the reward as Ω(s, a) = 1/100 + (10s + a)/600, and the behavior and target policies are defined as πb(a | s) = 0.2, if a s 1, 0.2, if a s, 0.6, if a s + 1, 0, otherwise, πe(a | s) = 0.1, s S, a A, where means equivalence in the sense of modulo 10. The initial state distribution is set as p(0) πb (s) = 1/30, s S. Our aim is to evaluate the value function at s0 = 0, i.e., vπe = Vπe(0). In the following, all simulation experiments are repeated by 1,000 times, and the number of samples used ranges from 5,000 to 100,000. |