reproducibilityindex.ai

Statistically Efficient Off-Policy Policy Gradients

Authors: Nathan Kallus, Masatoshi Uehara

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted an experiment in a simple environment to conﬁrm the theoretical guarantees of the proposed estimator.
Researcher Affiliation	Academia	1Cornell University, Ithaca, NY, USA 2Harvard University, Massachusetts, Boston, USA.
Pseudocode	Yes	Algorithm 1 Efﬁcient Off-Policy Policy Gradient
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the methodology, nor does it provide a link to a code repository.
Open Datasets	No	The setting is as follows. Set St = R, At = R, s0 = 0. Then, set the transition dynamics as st = at 1 st 1, the reward as rt = s2 t, the behavior policy as πb t(a \| s) = N(0.8s, 0.22), the policy class as N(θs, 0.22), and horizon as H = 49. Then, θ = 1 with optimal value J = 1.96, obtained by analytical calculation. This describes a synthetic environment and data generation process, not a publicly available dataset with concrete access information.
Dataset Splits	No	The paper describes experimental settings and number of replications, but does not provide specific training/validation/test dataset splits (e.g., percentages, sample counts, or predefined split citations).
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications, or cloud instance types) used for running the experiments.
Software Dependencies	No	The paper mentions 'Nuisances functions q, µ, dq, dµ are estimated by polynomial sieve regressions (Chen, 2007)' but does not provide specific version numbers for any software, libraries, or solvers used.
Experiment Setup	Yes	Second, in Fig. 3, we apply a gradient ascent as in Algorithm 4 with αt = 0.15 and T = 40. Nuisances functions q, µ, dq, dµ are estimated by polynomial sieve regressions (Chen, 2007).