reproducibilityindex.ai

Correcting discount-factor mismatch in on-policy policy gradient methods

Authors: Fengdi Che, Gautham Vasan, A. Rupam Mahmood

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	6. Experiments We test if our estimated correction can avoid the degenerate policy caused by the incorrect gradient under the discountfactor mismatch.
Researcher Affiliation	Academia	1Department of Computing Science, University of Alberta, Edmonton, Canada 2CIFAR AI Chair, Amii,Department of Computing Science, University of Alberta, Edmonton, Canada.
Pseudocode	Yes	Algorithm 1 BAC with Averaging Correction
Open Source Code	Yes	Source code: Averaged PPO and the rest.
Open Datasets	Yes	consistently matches or exceeds the original performance on several Open AI gym and Deep Mind suite benchmarks.
Dataset Splits	No	The paper refers to "training steps" but does not explicitly state details about validation splits, percentages, or methodology.
Hardware Specification	No	We train a UR5 robotic arm on the UR-Reacher-2 task, developed by Mahmood et al. (2018).
Software Dependencies	No	The original PPO algorithm follows the implementation by Open AI spinningup (Achiam 2018), and the other two algorithms are adjusted based on this implementation.
Experiment Setup	Yes	The hyperparameters are shown in Appendix F.