reproducibilityindex.ai

EPOpt: Learning Robust Neural Network Policies Using Model Ensembles

Authors: Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, Sergey Levine

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the proposed methods on the hopper (12 dimensional state space; 3 dimensional action space) and half-cheetah (18 dimensional state space; 6 dimensional action space) benchmarks in Mu Jo Co. Our experimental results suggest that adversarial training on model ensembles produces robust policies which generalize better than policies trained on a single, maximum-likelihood model (of source distribution) alone.
Researcher Affiliation	Academia	1 University of Washington Seattle 2 NITK Surathkal 3 Indian Institute of Technology Madras 4 University of California Berkeley
Pseudocode	Yes	Algorithm 1: EPOpt ϵ for Robust Policy Search
Open Source Code	Yes	Our implementation of the algorithms and environments are public in this repository to facilitate reproduction of results: https://github.com/aravindr93/robust RL
Open Datasets	Yes	We evaluated the proposed EPOpt-ϵ algorithm on the 2D hopper (Erez et al., 2011) and halfcheetah (Wawrzynski, 2009) benchmarks using the Mu Jo Co physics simulator (Todorov et al., 2012). ... For both tasks, we used the standard reward functions implemented with Open AI gym (Brockman et al., 2016), with minor modiﬁcations.
Dataset Splits	No	The paper describes how data is sampled from the simulated environment (e.g., “sample a trajectory τk = {st, at, rt, st+1}T 1 t=0 from M(pk) using policy π(θi)”) for training and evaluation. However, it does not specify a conventional training/validation/test split for a pre-existing dataset, as data is generated dynamically through simulation.
Hardware Specification	No	The paper states: “This was implemented in parallel on multiple (6) CPUs.” (Appendix A.2). This mentions the type and number of processors, but lacks specific details such as the model (e.g., Intel Xeon, AMD Ryzen), clock speed, or the presence of GPUs or memory specifications, which are crucial for full reproducibility.
Software Dependencies	No	The paper mentions several software components like “Mu Jo Co physics simulator (Todorov et al., 2012)”, “TRPO” for batch policy optimization, and “Open AI gym (Brockman et al., 2016)”. However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	The policy is represented with a Gaussian distribution, the mean of which is parametrized using a neural network with two hidden layers. Each hidden layer has 64 units, with a tanh non-linearity, and the ﬁnal output layer is made of linear units. ... The maximum KL divergence between sucessive policy updates are constrained to be 0.01 ... In each iteration, we sample N = 240 models from the ensemble, one rollout is performed on each such model. ... Each trajectory is of length 1000 ... The results in Fig 1 and Fig 2 were generated after 150 and 200 iterations of TRPO respectively... The first 100 iterations use ϵ = 1, and the final 100 iterations use the desired ϵ.