Supervised Policy Update for Deep Reinforcement Learning
Authors: Quan Vuong, Yiming Zhang, Keith W. Ross
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments show SPU outperforms TRPO in Mujoco simulated robotic tasks and outperforms PPO in Atari video game tasks. |
| Researcher Affiliation | Academia | Quan Vuong University of California, San Diego qvuong@ucsd.edu Yiming Zhang New York University yiming.zhang@cs.nyu.edu Keith Ross New York University/New York University Shanghai keithwross@nyu.edu |
| Pseudocode | Yes | Algorithm 1 Algorithmic description of forward-KL non-parameterized SPU |
| Open Source Code | Yes | Code for the Mujoco experiments is at https://github.com/quanvuong/Supervised_Policy_Update. |
| Open Datasets | Yes | The Mujoco (Todorov et al., 2012) simulated robotics environments provided by Open AI gym (Brockman et al., 2016) have become a popular benchmark for control problems with continuous action spaces. ... on the Arcade Learning Environments (Bellemare et al., 2012) exposed through Open AI gym (Brockman et al., 2016). |
| Dataset Splits | No | The paper discusses training duration in timesteps and evaluates performance based on episodes, but it does not specify explicit training, validation, or test dataset splits in terms of percentages or sample counts. |
| Hardware Specification | No | The paper acknowledges 'the extremely helpful support by the NYU Shanghai High Performance Computing Administrator Zhiguo Qi', implying the use of HPC resources. However, it does not specify any particular hardware components like CPU models, GPU models, or memory specifications. |
| Software Dependencies | No | The paper mentions using 'Adam (Kingma & Ba, 2014)' for gradient descent and references 'Open AI baselines (Dhariwal et al., 2017), commit 3cc7df060800a45890908045b79821a13c4babdb' for baselines. While a commit ID is specific, the paper does not list version numbers for general programming languages (e.g., Python), deep learning frameworks (e.g., TensorFlow, PyTorch), or other key software libraries used in the experiments. |
| Experiment Setup | Yes | For Mujoco environments, ... Gradient descent is performed using Adam (Kingma & Ba, 2014) with step size 0.0003, minibatch size of 64. ... γ and λ for GAE (Schulman et al., 2015b) are set to 0.99 and 0.95 respectively. For SPU, δ, ϵ, λ and the maximum number of epochs per iteration are set to 0.05/1.2, 0.05, 1.3 and 30 respectively. Training is performed for 1 million timesteps for both SPU and PPO. In the sensitivity analysis, the ranges of values for the hyper-parameters δ, ϵ, λ and maximum number of epochs are [0.05, 0.07], [0.01, 0.07], [1.0, 1.2] and [5, 30] respectively. |