Correcting discount-factor mismatch in on-policy policy gradient methods
Authors: Fengdi Che, Gautham Vasan, A. Rupam Mahmood
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6. Experiments We test if our estimated correction can avoid the degenerate policy caused by the incorrect gradient under the discountfactor mismatch. |
| Researcher Affiliation | Academia | 1Department of Computing Science, University of Alberta, Edmonton, Canada 2CIFAR AI Chair, Amii,Department of Computing Science, University of Alberta, Edmonton, Canada. |
| Pseudocode | Yes | Algorithm 1 BAC with Averaging Correction |
| Open Source Code | Yes | Source code: Averaged PPO and the rest. |
| Open Datasets | Yes | consistently matches or exceeds the original performance on several Open AI gym and Deep Mind suite benchmarks. |
| Dataset Splits | No | The paper refers to "training steps" but does not explicitly state details about validation splits, percentages, or methodology. |
| Hardware Specification | No | We train a UR5 robotic arm on the UR-Reacher-2 task, developed by Mahmood et al. (2018). |
| Software Dependencies | No | The original PPO algorithm follows the implementation by Open AI spinningup (Achiam 2018), and the other two algorithms are adjusted based on this implementation. |
| Experiment Setup | Yes | The hyperparameters are shown in Appendix F. |