Model-Ensemble Trust-Region Policy Optimization
Authors: Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, Pieter Abbeel
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We design the experiments to answer the following questions: 1. How does our approach compare against state-of-the-art methods in terms of sample complexity and final performance? 2. What are the failure scenarios of the vanilla algorithm? 3. How does our method overcome these failures? and The results are shown in Figure 2. Prior model-based methods appear to achieve worse performance compared with model-free methods. In addition, we find that model-based methods tend to be difficult to train over long horizons. In particular, SVG(1), not presented in the plots, is very unstable in our experiments. |
| Researcher Affiliation | Academia | Berkeley AI Research University of California, Berkeley Berkeley, CA 94709 {thanard.kurutach, iclavera, rockyduan, avivt, pabbeel}@berkeley.edu |
| Pseudocode | Yes | Algorithm 1 Vanilla Model-Based Deep Reinforcement Learning and Algorithm 2 Model Ensemble Trust Region Policy Optimization (ME-TRPO) |
| Open Source Code | No | The paper only provides a link to "Videos available at: https://sites.google.com/view/me-trpo" and does not explicitly state that the source code for the methodology is available. |
| Open Datasets | Yes | We evaluate our method and various baselines over six standard continuous control benchmark tasks (Duan et al., 2016; Hesse et al., 2017) in Mujoco (Todorov et al., 2012): Swimmer, Snake, Hopper, Ant, Half Cheetah, and Humanoid, shown in Figure 1. The details of the tasks can be found in Appendix A.2. and The environments we use are adopted from rllab (Duan et al., 2016). |
| Dataset Splits | Yes | Finally, we split the collected data using a 2-to-1 ratio for training and validation datasets. |
| Hardware Specification | Yes | These experiments were performed on Amazon EC2 using 1 NVIDIA K80 GPU, 4 v CPUs, and 61 GB of memory. |
| Software Dependencies | No | We use the Adam optimizer (Kingma and Ba, 2014) to solve this supervised learning problem. and We evaluated using Vanilla Policy Gradient (VPG) (Peters and Schaal, 2006), Proximal Policy Optimization (PPO) (Schulman et al., 2017), and Trust Region Policy Optimization (TRPO) (Schulman et al., 2015). and in Mujoco (Todorov et al., 2012) and The environments we use are adopted from rllab (Duan et al., 2016). |
| Experiment Setup | Yes | We represent the dynamics model with a 2-hidden-layer feed-forward neural network with hidden sizes 1024-1024 and Re LU nonlinearities. We train the model with the Adam optimizer with learning rate 0.001 using a batch size of 1000. and We represent the policy with a 2-hidden-layer feed-forward neural network with hidden sizes 32-32 and tanh nonlinearities for all the environments, except Humanoid, in which we use the hidden sizes 100-50-25. The policy is trained with TRPO on the learned models using initial standard deviation 1.0, step size δKL 0.01, and batch size 50000. |