Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Correcting discount-factor mismatch in on-policy policy gradient methods
Authors: Fengdi Che, Gautham Vasan, A. Rupam Mahmood
ICML 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6. Experiments We test if our estimated correction can avoid the degenerate policy caused by the incorrect gradient under the discountfactor mismatch. |
| Researcher Affiliation | Academia | 1Department of Computing Science, University of Alberta, Edmonton, Canada 2CIFAR AI Chair, Amii,Department of Computing Science, University of Alberta, Edmonton, Canada. |
| Pseudocode | Yes | Algorithm 1 BAC with Averaging Correction |
| Open Source Code | Yes | Source code: Averaged PPO and the rest. |
| Open Datasets | Yes | consistently matches or exceeds the original performance on several Open AI gym and Deep Mind suite benchmarks. |
| Dataset Splits | No | The paper refers to "training steps" but does not explicitly state details about validation splits, percentages, or methodology. |
| Hardware Specification | No | We train a UR5 robotic arm on the UR-Reacher-2 task, developed by Mahmood et al. (2018). |
| Software Dependencies | No | The original PPO algorithm follows the implementation by Open AI spinningup (Achiam 2018), and the other two algorithms are adjusted based on this implementation. |
| Experiment Setup | Yes | The hyperparameters are shown in Appendix F. |