reproducibilityindex.ai

Model-Ensemble Trust-Region Policy Optimization

Authors: Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, Pieter Abbeel

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We design the experiments to answer the following questions: 1. How does our approach compare against state-of-the-art methods in terms of sample complexity and ﬁnal performance? 2. What are the failure scenarios of the vanilla algorithm? 3. How does our method overcome these failures? and The results are shown in Figure 2. Prior model-based methods appear to achieve worse performance compared with model-free methods. In addition, we ﬁnd that model-based methods tend to be difﬁcult to train over long horizons. In particular, SVG(1), not presented in the plots, is very unstable in our experiments.
Researcher Affiliation	Academia	Berkeley AI Research University of California, Berkeley Berkeley, CA 94709 {thanard.kurutach, iclavera, rockyduan, avivt, pabbeel}@berkeley.edu
Pseudocode	Yes	Algorithm 1 Vanilla Model-Based Deep Reinforcement Learning and Algorithm 2 Model Ensemble Trust Region Policy Optimization (ME-TRPO)
Open Source Code	No	The paper only provides a link to "Videos available at: https://sites.google.com/view/me-trpo" and does not explicitly state that the source code for the methodology is available.
Open Datasets	Yes	We evaluate our method and various baselines over six standard continuous control benchmark tasks (Duan et al., 2016; Hesse et al., 2017) in Mujoco (Todorov et al., 2012): Swimmer, Snake, Hopper, Ant, Half Cheetah, and Humanoid, shown in Figure 1. The details of the tasks can be found in Appendix A.2. and The environments we use are adopted from rllab (Duan et al., 2016).
Dataset Splits	Yes	Finally, we split the collected data using a 2-to-1 ratio for training and validation datasets.
Hardware Specification	Yes	These experiments were performed on Amazon EC2 using 1 NVIDIA K80 GPU, 4 v CPUs, and 61 GB of memory.
Software Dependencies	No	We use the Adam optimizer (Kingma and Ba, 2014) to solve this supervised learning problem. and We evaluated using Vanilla Policy Gradient (VPG) (Peters and Schaal, 2006), Proximal Policy Optimization (PPO) (Schulman et al., 2017), and Trust Region Policy Optimization (TRPO) (Schulman et al., 2015). and in Mujoco (Todorov et al., 2012) and The environments we use are adopted from rllab (Duan et al., 2016).
Experiment Setup	Yes	We represent the dynamics model with a 2-hidden-layer feed-forward neural network with hidden sizes 1024-1024 and Re LU nonlinearities. We train the model with the Adam optimizer with learning rate 0.001 using a batch size of 1000. and We represent the policy with a 2-hidden-layer feed-forward neural network with hidden sizes 32-32 and tanh nonlinearities for all the environments, except Humanoid, in which we use the hidden sizes 100-50-25. The policy is trained with TRPO on the learned models using initial standard deviation 1.0, step size δKL 0.01, and batch size 50000.