Statistically Efficient Off-Policy Policy Gradients
Authors: Nathan Kallus, Masatoshi Uehara
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted an experiment in a simple environment to confirm the theoretical guarantees of the proposed estimator. |
| Researcher Affiliation | Academia | 1Cornell University, Ithaca, NY, USA 2Harvard University, Massachusetts, Boston, USA. |
| Pseudocode | Yes | Algorithm 1 Efficient Off-Policy Policy Gradient |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the methodology, nor does it provide a link to a code repository. |
| Open Datasets | No | The setting is as follows. Set St = R, At = R, s0 = 0. Then, set the transition dynamics as st = at 1 st 1, the reward as rt = s2 t, the behavior policy as πb t(a | s) = N(0.8s, 0.22), the policy class as N(θs, 0.22), and horizon as H = 49. Then, θ = 1 with optimal value J = 1.96, obtained by analytical calculation. This describes a synthetic environment and data generation process, not a publicly available dataset with concrete access information. |
| Dataset Splits | No | The paper describes experimental settings and number of replications, but does not provide specific training/validation/test dataset splits (e.g., percentages, sample counts, or predefined split citations). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Nuisances functions q, µ, dq, dµ are estimated by polynomial sieve regressions (Chen, 2007)' but does not provide specific version numbers for any software, libraries, or solvers used. |
| Experiment Setup | Yes | Second, in Fig. 3, we apply a gradient ascent as in Algorithm 4 with αt = 0.15 and T = 40. Nuisances functions q, µ, dq, dµ are estimated by polynomial sieve regressions (Chen, 2007). |