reproducibilityindex.ai

Semiparametrically Efficient Off-Policy Evaluation in Linear Markov Decision Processes

Authors: Chuhan Xie, Wenhao Yang, Zhihua Zhang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we implement simulation experiments to demonstrate the efficiency of our estimator and the validity of our proposed inference procedure.
Researcher Affiliation	Academia	1School of Mathematical Sciences, Peking University, Beijing, China 2Academy of Advanced Interdisciplinary Studies, Peking University, Beijing, China.
Pseudocode	Yes	Algorithm 1 One-Step Estimator
Open Source Code	No	The paper does not provide any specific links to a code repository or explicit statements about the open-source availability of the methodology described in this paper.
Open Datasets	No	We consider a linear MDP with discrete state and action spaces, where \|S\| = 30, \|A\| = 10, d = 5 and γ = 0.8. The feature map {ϕ(s, a)}s S,a A is constructed by drawing i.i.d. Exp(1) random variables for each component of ϕ(s, a) and then normalizing it to satisfy Pd i=1 ϕi(s, a) = 1. The reward parameter ω0 has its components generated from i.i.d. Unif([0, 1]), and for each s S, the transition parameter ν0(s) has its components generated from i.i.d. Exp(1) followed by normalization to satisfy P s S ν0(s) = 1d. The initial state distribution is set as p(0) πb (s) = 1/30, s S. This indicates a synthetic data generation process rather than the use of a publicly available dataset with concrete access information.
Dataset Splits	Yes	We construct an estimator without sample splitting (i.e., all samples are used to construct nuisance estimates), a 2-fold sample splitting estimator and a 5-fold sample splitting estimator.
Hardware Specification	No	The paper does not provide specific details regarding the hardware (e.g., CPU/GPU models, memory, cloud instances) used for running the experiments.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific solver versions) that would be needed to replicate the experiment.
Experiment Setup	Yes	We consider a linear MDP with discrete state and action spaces, where \|S\| = 30, \|A\| = 10, d = 5 and γ = 0.8. The feature map {ϕ(s, a)}s S,a A is constructed by drawing i.i.d. Exp(1) random variables for each component of ϕ(s, a) and then normalizing it to satisfy Pd i=1 ϕi(s, a) = 1. The reward parameter ω0 has its components generated from i.i.d. Unif([0, 1]), and for each s S, the transition parameter ν0(s) has its components generated from i.i.d. Exp(1) followed by normalization to satisfy P s S ν0(s) = 1d. The feature map and true parameters are kept fixed once they are generated. Denoting S = {0, 1, . . . , 29} and A = {0, 1, . . . , 9}, we set the variance of the reward as Ω(s, a) = 1/100 + (10s + a)/600, and the behavior and target policies are defined as πb(a \| s) = 0.2, if a s 1, 0.2, if a s, 0.6, if a s + 1, 0, otherwise, πe(a \| s) = 0.1, s S, a A, where means equivalence in the sense of modulo 10. The initial state distribution is set as p(0) πb (s) = 1/30, s S. Our aim is to evaluate the value function at s0 = 0, i.e., vπe = Vπe(0). In the following, all simulation experiments are repeated by 1,000 times, and the number of samples used ranges from 5,000 to 100,000.