Statistically Efficient Off-Policy Policy Gradients

Authors: Nathan Kallus, Masatoshi Uehara

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted an experiment in a simple environment to confirm the theoretical guarantees of the proposed estimator.
Researcher Affiliation Academia 1Cornell University, Ithaca, NY, USA 2Harvard University, Massachusetts, Boston, USA.
Pseudocode Yes Algorithm 1 Efficient Off-Policy Policy Gradient
Open Source Code No The paper does not contain any explicit statements about releasing source code for the methodology, nor does it provide a link to a code repository.
Open Datasets No The setting is as follows. Set St = R, At = R, s0 = 0. Then, set the transition dynamics as st = at 1 st 1, the reward as rt = s2 t, the behavior policy as πb t(a | s) = N(0.8s, 0.22), the policy class as N(θs, 0.22), and horizon as H = 49. Then, θ = 1 with optimal value J = 1.96, obtained by analytical calculation. This describes a synthetic environment and data generation process, not a publicly available dataset with concrete access information.
Dataset Splits No The paper describes experimental settings and number of replications, but does not provide specific training/validation/test dataset splits (e.g., percentages, sample counts, or predefined split citations).
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications, or cloud instance types) used for running the experiments.
Software Dependencies No The paper mentions 'Nuisances functions q, µ, dq, dµ are estimated by polynomial sieve regressions (Chen, 2007)' but does not provide specific version numbers for any software, libraries, or solvers used.
Experiment Setup Yes Second, in Fig. 3, we apply a gradient ascent as in Algorithm 4 with αt = 0.15 and T = 40. Nuisances functions q, µ, dq, dµ are estimated by polynomial sieve regressions (Chen, 2007).