Episodic Policy Gradient Training

Authors: Hung Le, Majid Abdolshah, Thommen K. George, Kien Do, Dung Nguyen, Svetha Venkatesh7317-7325

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on both continuous and discrete environments demonstrate the advantage of using the proposed method in boosting the performance of various policy gradient algorithms.
Researcher Affiliation Academia Applied AI Institute, Deakin University, Geelong, Australia {thai.le, m.abdolshah, thommen.karimpanalgeorge, k.do, dung.nguyen, svetha.venkatesh}@deakin.edu.au
Pseudocode Yes Algorithm 1: Episodic Policy Gradient Training.
Open Source Code Yes Our code can be found at https://github.com/thaihungle/EPGT
Open Datasets Yes We test on 2 environments: Mountain Car Continuous (MCC) and Bipedal Walker (BW)... We conduct experiments on 6 Mujoco environments: Half Cheetah, Hooper, Walker2d, Swimmer, Ant and Humanoid... We adopt 6 standard Atari games...
Dataset Splits No The paper describes training durations (e.g., 'train agents for 5 million steps') and evaluation metrics, but does not explicitly provide information about static training, validation, or test dataset splits, which is common in reinforcement learning where data is generated through interaction with an environment rather than from a fixed dataset.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory specifications, or types of computing infrastructure used for running the experiments.
Software Dependencies No The paper mentions the use of different policy gradient methods (A2C, ACKTR, PPO) and implies programming in Python, but it does not specify exact version numbers for any software dependencies or libraries.
Experiment Setup Yes The experimental details can be found in the Appendix B. We test on 2 environments: Mountain Car Continuous (MCC) and Bipedal Walker (BW) with long and short learning rate search ranges ([4 10 5, 10 2] and [2.8 10 4, 1.8 10 3], respectively). We train agents for 5 million steps and report the mean (and std. if applicable) over 10 runs.