reproducibilityindex.ai

EvIL: Evolution Strategies for Generalisable Imitation Learning

Authors: Silvia Sapora, Gokul Swamy, Chris Lu, Yee Whye Teh, Jakob Nicolaus Foerster

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform extensive experimental evaluation of our proposed method across a suite of continuous control tasks and find that it leads to significantly more efficient and effective retraining in source and target environments than prior work.
Researcher Affiliation	Academia	1University of Oxford, UK 2Carnegie Mellon University, USA.
Pseudocode	Yes	Algorithm 1 Reward Shaping with Evolution Strategies and Algorithm 2 Ev IL: Evolution Strategies for Generalisable Imitation are provided in the paper.
Open Source Code	No	The paper states: “All our code is implemented in JAX (Bradbury et al., 2018) using the Pure Jax RL (Lu et al., 2022a), Brax (Freeman et al., 2021), and evosax (Lange, 2023) libraries to maximise parallelisation of training.” This lists third-party libraries used, but does not provide a specific link or explicit statement that the code for the authors' method (Ev IL) is open-source or publicly available.
Open Datasets	Yes	We conduct our experiments across three distinct Mu Jo Co environments: Hopper, Walker, and Ant. All learners receive 100 trajectories from the expert policy, trained using Proximal Policy Optimisation (PPO) (Schulman et al., 2017) over 5e7 timesteps.
Dataset Splits	No	The paper discusses concepts of training and testing environments for transfer learning but does not specify explicit dataset splits (e.g., percentages or counts) for training, validation, or testing data within an environment.
Hardware Specification	No	The paper mentions “SS was supported by Google TPU Research Cloud (TRC) and Google Cloud Research Credits program” in the acknowledgements, indicating the use of TPUs, but it does not specify the exact TPU model (e.g., TPU v2, v3) or any other specific hardware components like GPU or CPU models.
Software Dependencies	No	The paper states: “All our code is implemented in JAX (Bradbury et al., 2018) using the Pure Jax RL (Lu et al., 2022a), Brax (Freeman et al., 2021), and evosax (Lange, 2023) libraries to maximise parallelisation of training.” While it lists software names, it does not provide specific version numbers for any of these dependencies.
Experiment Setup	Yes	All learners receive 100 trajectories from the expert policy, trained using Proximal Policy Optimisation (PPO) (Schulman et al., 2017) over 5e7 timesteps. Appendix C provides detailed hyperparameters in Table 2 ('Hyperparameters for Training IRL') and Table 3 ('Important parameters for Training Reward Shaping with ES'), including values for 'Number of Reward Hidden Layers', 'Size of Reward Hidden Layer', 'Inner Loop Learning Rate', and 'Population Size'.