Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Correcting discount-factor mismatch in on-policy policy gradient methods

Authors: Fengdi Che, Gautham Vasan, A. Rupam Mahmood

ICML 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6. Experiments We test if our estimated correction can avoid the degenerate policy caused by the incorrect gradient under the discountfactor mismatch.
Researcher Affiliation Academia 1Department of Computing Science, University of Alberta, Edmonton, Canada 2CIFAR AI Chair, Amii,Department of Computing Science, University of Alberta, Edmonton, Canada.
Pseudocode Yes Algorithm 1 BAC with Averaging Correction
Open Source Code Yes Source code: Averaged PPO and the rest.
Open Datasets Yes consistently matches or exceeds the original performance on several Open AI gym and Deep Mind suite benchmarks.
Dataset Splits No The paper refers to "training steps" but does not explicitly state details about validation splits, percentages, or methodology.
Hardware Specification No We train a UR5 robotic arm on the UR-Reacher-2 task, developed by Mahmood et al. (2018).
Software Dependencies No The original PPO algorithm follows the implementation by Open AI spinningup (Achiam 2018), and the other two algorithms are adjusted based on this implementation.
Experiment Setup Yes The hyperparameters are shown in Appendix F.