EvIL: Evolution Strategies for Generalisable Imitation Learning

Authors: Silvia Sapora, Gokul Swamy, Chris Lu, Yee Whye Teh, Jakob Nicolaus Foerster

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive experimental evaluation of our proposed method across a suite of continuous control tasks and find that it leads to significantly more efficient and effective retraining in source and target environments than prior work.
Researcher Affiliation Academia 1University of Oxford, UK 2Carnegie Mellon University, USA.
Pseudocode Yes Algorithm 1 Reward Shaping with Evolution Strategies and Algorithm 2 Ev IL: Evolution Strategies for Generalisable Imitation are provided in the paper.
Open Source Code No The paper states: “All our code is implemented in JAX (Bradbury et al., 2018) using the Pure Jax RL (Lu et al., 2022a), Brax (Freeman et al., 2021), and evosax (Lange, 2023) libraries to maximise parallelisation of training.” This lists third-party libraries used, but does not provide a specific link or explicit statement that the code for the authors' method (Ev IL) is open-source or publicly available.
Open Datasets Yes We conduct our experiments across three distinct Mu Jo Co environments: Hopper, Walker, and Ant. All learners receive 100 trajectories from the expert policy, trained using Proximal Policy Optimisation (PPO) (Schulman et al., 2017) over 5e7 timesteps.
Dataset Splits No The paper discusses concepts of training and testing environments for transfer learning but does not specify explicit dataset splits (e.g., percentages or counts) for training, validation, or testing data within an environment.
Hardware Specification No The paper mentions “SS was supported by Google TPU Research Cloud (TRC) and Google Cloud Research Credits program” in the acknowledgements, indicating the use of TPUs, but it does not specify the exact TPU model (e.g., TPU v2, v3) or any other specific hardware components like GPU or CPU models.
Software Dependencies No The paper states: “All our code is implemented in JAX (Bradbury et al., 2018) using the Pure Jax RL (Lu et al., 2022a), Brax (Freeman et al., 2021), and evosax (Lange, 2023) libraries to maximise parallelisation of training.” While it lists software names, it does not provide specific version numbers for any of these dependencies.
Experiment Setup Yes All learners receive 100 trajectories from the expert policy, trained using Proximal Policy Optimisation (PPO) (Schulman et al., 2017) over 5e7 timesteps. Appendix C provides detailed hyperparameters in Table 2 ('Hyperparameters for Training IRL') and Table 3 ('Important parameters for Training Reward Shaping with ES'), including values for 'Number of Reward Hidden Layers', 'Size of Reward Hidden Layer', 'Inner Loop Learning Rate', and 'Population Size'.