EPOpt: Learning Robust Neural Network Policies Using Model Ensembles
Authors: Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, Sergey Levine
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed methods on the hopper (12 dimensional state space; 3 dimensional action space) and half-cheetah (18 dimensional state space; 6 dimensional action space) benchmarks in Mu Jo Co. Our experimental results suggest that adversarial training on model ensembles produces robust policies which generalize better than policies trained on a single, maximum-likelihood model (of source distribution) alone. |
| Researcher Affiliation | Academia | 1 University of Washington Seattle 2 NITK Surathkal 3 Indian Institute of Technology Madras 4 University of California Berkeley |
| Pseudocode | Yes | Algorithm 1: EPOpt ϵ for Robust Policy Search |
| Open Source Code | Yes | Our implementation of the algorithms and environments are public in this repository to facilitate reproduction of results: https://github.com/aravindr93/robust RL |
| Open Datasets | Yes | We evaluated the proposed EPOpt-ϵ algorithm on the 2D hopper (Erez et al., 2011) and halfcheetah (Wawrzynski, 2009) benchmarks using the Mu Jo Co physics simulator (Todorov et al., 2012). ... For both tasks, we used the standard reward functions implemented with Open AI gym (Brockman et al., 2016), with minor modifications. |
| Dataset Splits | No | The paper describes how data is sampled from the simulated environment (e.g., “sample a trajectory τk = {st, at, rt, st+1}T 1 t=0 from M(pk) using policy π(θi)”) for training and evaluation. However, it does not specify a conventional training/validation/test split for a pre-existing dataset, as data is generated dynamically through simulation. |
| Hardware Specification | No | The paper states: “This was implemented in parallel on multiple (6) CPUs.” (Appendix A.2). This mentions the type and number of processors, but lacks specific details such as the model (e.g., Intel Xeon, AMD Ryzen), clock speed, or the presence of GPUs or memory specifications, which are crucial for full reproducibility. |
| Software Dependencies | No | The paper mentions several software components like “Mu Jo Co physics simulator (Todorov et al., 2012)”, “TRPO” for batch policy optimization, and “Open AI gym (Brockman et al., 2016)”. However, it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | The policy is represented with a Gaussian distribution, the mean of which is parametrized using a neural network with two hidden layers. Each hidden layer has 64 units, with a tanh non-linearity, and the final output layer is made of linear units. ... The maximum KL divergence between sucessive policy updates are constrained to be 0.01 ... In each iteration, we sample N = 240 models from the ensemble, one rollout is performed on each such model. ... Each trajectory is of length 1000 ... The results in Fig 1 and Fig 2 were generated after 150 and 200 iterations of TRPO respectively... The first 100 iterations use ϵ = 1, and the final 100 iterations use the desired ϵ. |