Model-Ensemble Trust-Region Policy Optimization

Authors: Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, Pieter Abbeel

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We design the experiments to answer the following questions: 1. How does our approach compare against state-of-the-art methods in terms of sample complexity and final performance? 2. What are the failure scenarios of the vanilla algorithm? 3. How does our method overcome these failures? and The results are shown in Figure 2. Prior model-based methods appear to achieve worse performance compared with model-free methods. In addition, we find that model-based methods tend to be difficult to train over long horizons. In particular, SVG(1), not presented in the plots, is very unstable in our experiments.
Researcher Affiliation Academia Berkeley AI Research University of California, Berkeley Berkeley, CA 94709 {thanard.kurutach, iclavera, rockyduan, avivt, pabbeel}@berkeley.edu
Pseudocode Yes Algorithm 1 Vanilla Model-Based Deep Reinforcement Learning and Algorithm 2 Model Ensemble Trust Region Policy Optimization (ME-TRPO)
Open Source Code No The paper only provides a link to "Videos available at: https://sites.google.com/view/me-trpo" and does not explicitly state that the source code for the methodology is available.
Open Datasets Yes We evaluate our method and various baselines over six standard continuous control benchmark tasks (Duan et al., 2016; Hesse et al., 2017) in Mujoco (Todorov et al., 2012): Swimmer, Snake, Hopper, Ant, Half Cheetah, and Humanoid, shown in Figure 1. The details of the tasks can be found in Appendix A.2. and The environments we use are adopted from rllab (Duan et al., 2016).
Dataset Splits Yes Finally, we split the collected data using a 2-to-1 ratio for training and validation datasets.
Hardware Specification Yes These experiments were performed on Amazon EC2 using 1 NVIDIA K80 GPU, 4 v CPUs, and 61 GB of memory.
Software Dependencies No We use the Adam optimizer (Kingma and Ba, 2014) to solve this supervised learning problem. and We evaluated using Vanilla Policy Gradient (VPG) (Peters and Schaal, 2006), Proximal Policy Optimization (PPO) (Schulman et al., 2017), and Trust Region Policy Optimization (TRPO) (Schulman et al., 2015). and in Mujoco (Todorov et al., 2012) and The environments we use are adopted from rllab (Duan et al., 2016).
Experiment Setup Yes We represent the dynamics model with a 2-hidden-layer feed-forward neural network with hidden sizes 1024-1024 and Re LU nonlinearities. We train the model with the Adam optimizer with learning rate 0.001 using a batch size of 1000. and We represent the policy with a 2-hidden-layer feed-forward neural network with hidden sizes 32-32 and tanh nonlinearities for all the environments, except Humanoid, in which we use the hidden sizes 100-50-25. The policy is trained with TRPO on the learned models using initial standard deviation 1.0, step size δKL 0.01, and batch size 50000.