Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Statistically Efficient Off-Policy Policy Gradients
Authors: Nathan Kallus, Masatoshi Uehara
ICML 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted an experiment in a simple environment to confirm the theoretical guarantees of the proposed estimator. |
| Researcher Affiliation | Academia | 1Cornell University, Ithaca, NY, USA 2Harvard University, Massachusetts, Boston, USA. |
| Pseudocode | Yes | Algorithm 1 Efficient Off-Policy Policy Gradient |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the methodology, nor does it provide a link to a code repository. |
| Open Datasets | No | The setting is as follows. Set St = R, At = R, s0 = 0. Then, set the transition dynamics as st = at 1 st 1, the reward as rt = s2 t, the behavior policy as πb t(a | s) = N(0.8s, 0.22), the policy class as N(θs, 0.22), and horizon as H = 49. Then, θ = 1 with optimal value J = 1.96, obtained by analytical calculation. This describes a synthetic environment and data generation process, not a publicly available dataset with concrete access information. |
| Dataset Splits | No | The paper describes experimental settings and number of replications, but does not provide specific training/validation/test dataset splits (e.g., percentages, sample counts, or predefined split citations). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Nuisances functions q, µ, dq, dµ are estimated by polynomial sieve regressions (Chen, 2007)' but does not provide specific version numbers for any software, libraries, or solvers used. |
| Experiment Setup | Yes | Second, in Fig. 3, we apply a gradient ascent as in Algorithm 4 with αt = 0.15 and T = 40. Nuisances functions q, µ, dq, dµ are estimated by polynomial sieve regressions (Chen, 2007). |