reproducibilityindex.ai

Supervised Policy Update for Deep Reinforcement Learning

Authors: Quan Vuong, Yiming Zhang, Keith W. Ross

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments show SPU outperforms TRPO in Mujoco simulated robotic tasks and outperforms PPO in Atari video game tasks.
Researcher Affiliation	Academia	Quan Vuong University of California, San Diego qvuong@ucsd.edu Yiming Zhang New York University yiming.zhang@cs.nyu.edu Keith Ross New York University/New York University Shanghai keithwross@nyu.edu
Pseudocode	Yes	Algorithm 1 Algorithmic description of forward-KL non-parameterized SPU
Open Source Code	Yes	Code for the Mujoco experiments is at https://github.com/quanvuong/Supervised_Policy_Update.
Open Datasets	Yes	The Mujoco (Todorov et al., 2012) simulated robotics environments provided by Open AI gym (Brockman et al., 2016) have become a popular benchmark for control problems with continuous action spaces. ... on the Arcade Learning Environments (Bellemare et al., 2012) exposed through Open AI gym (Brockman et al., 2016).
Dataset Splits	No	The paper discusses training duration in timesteps and evaluates performance based on episodes, but it does not specify explicit training, validation, or test dataset splits in terms of percentages or sample counts.
Hardware Specification	No	The paper acknowledges 'the extremely helpful support by the NYU Shanghai High Performance Computing Administrator Zhiguo Qi', implying the use of HPC resources. However, it does not specify any particular hardware components like CPU models, GPU models, or memory specifications.
Software Dependencies	No	The paper mentions using 'Adam (Kingma & Ba, 2014)' for gradient descent and references 'Open AI baselines (Dhariwal et al., 2017), commit 3cc7df060800a45890908045b79821a13c4babdb' for baselines. While a commit ID is specific, the paper does not list version numbers for general programming languages (e.g., Python), deep learning frameworks (e.g., TensorFlow, PyTorch), or other key software libraries used in the experiments.
Experiment Setup	Yes	For Mujoco environments, ... Gradient descent is performed using Adam (Kingma & Ba, 2014) with step size 0.0003, minibatch size of 64. ... γ and λ for GAE (Schulman et al., 2015b) are set to 0.99 and 0.95 respectively. For SPU, δ, ϵ, λ and the maximum number of epochs per iteration are set to 0.05/1.2, 0.05, 1.3 and 30 respectively. Training is performed for 1 million timesteps for both SPU and PPO. In the sensitivity analysis, the ranges of values for the hyper-parameters δ, ϵ, λ and maximum number of epochs are [0.05, 0.07], [0.01, 0.07], [1.0, 1.2] and [5, 30] respectively.