Correcting discount-factor mismatch in on-policy policy gradient methods

Authors: Fengdi Che, Gautham Vasan, A. Rupam Mahmood

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6. Experiments We test if our estimated correction can avoid the degenerate policy caused by the incorrect gradient under the discountfactor mismatch.
Researcher Affiliation Academia 1Department of Computing Science, University of Alberta, Edmonton, Canada 2CIFAR AI Chair, Amii,Department of Computing Science, University of Alberta, Edmonton, Canada.
Pseudocode Yes Algorithm 1 BAC with Averaging Correction
Open Source Code Yes Source code: Averaged PPO and the rest.
Open Datasets Yes consistently matches or exceeds the original performance on several Open AI gym and Deep Mind suite benchmarks.
Dataset Splits No The paper refers to "training steps" but does not explicitly state details about validation splits, percentages, or methodology.
Hardware Specification No We train a UR5 robotic arm on the UR-Reacher-2 task, developed by Mahmood et al. (2018).
Software Dependencies No The original PPO algorithm follows the implementation by Open AI spinningup (Achiam 2018), and the other two algorithms are adjusted based on this implementation.
Experiment Setup Yes The hyperparameters are shown in Appendix F.