Supervised Policy Update for Deep Reinforcement Learning

Authors: Quan Vuong, Yiming Zhang, Keith W. Ross

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments show SPU outperforms TRPO in Mujoco simulated robotic tasks and outperforms PPO in Atari video game tasks.
Researcher Affiliation Academia Quan Vuong University of California, San Diego qvuong@ucsd.edu Yiming Zhang New York University yiming.zhang@cs.nyu.edu Keith Ross New York University/New York University Shanghai keithwross@nyu.edu
Pseudocode Yes Algorithm 1 Algorithmic description of forward-KL non-parameterized SPU
Open Source Code Yes Code for the Mujoco experiments is at https://github.com/quanvuong/Supervised_Policy_Update.
Open Datasets Yes The Mujoco (Todorov et al., 2012) simulated robotics environments provided by Open AI gym (Brockman et al., 2016) have become a popular benchmark for control problems with continuous action spaces. ... on the Arcade Learning Environments (Bellemare et al., 2012) exposed through Open AI gym (Brockman et al., 2016).
Dataset Splits No The paper discusses training duration in timesteps and evaluates performance based on episodes, but it does not specify explicit training, validation, or test dataset splits in terms of percentages or sample counts.
Hardware Specification No The paper acknowledges 'the extremely helpful support by the NYU Shanghai High Performance Computing Administrator Zhiguo Qi', implying the use of HPC resources. However, it does not specify any particular hardware components like CPU models, GPU models, or memory specifications.
Software Dependencies No The paper mentions using 'Adam (Kingma & Ba, 2014)' for gradient descent and references 'Open AI baselines (Dhariwal et al., 2017), commit 3cc7df060800a45890908045b79821a13c4babdb' for baselines. While a commit ID is specific, the paper does not list version numbers for general programming languages (e.g., Python), deep learning frameworks (e.g., TensorFlow, PyTorch), or other key software libraries used in the experiments.
Experiment Setup Yes For Mujoco environments, ... Gradient descent is performed using Adam (Kingma & Ba, 2014) with step size 0.0003, minibatch size of 64. ... γ and λ for GAE (Schulman et al., 2015b) are set to 0.99 and 0.95 respectively. For SPU, δ, ϵ, λ and the maximum number of epochs per iteration are set to 0.05/1.2, 0.05, 1.3 and 30 respectively. Training is performed for 1 million timesteps for both SPU and PPO. In the sensitivity analysis, the ranges of values for the hyper-parameters δ, ϵ, λ and maximum number of epochs are [0.05, 0.07], [0.01, 0.07], [1.0, 1.2] and [5, 30] respectively.