EvIL: Evolution Strategies for Generalisable Imitation Learning
Authors: Silvia Sapora, Gokul Swamy, Chris Lu, Yee Whye Teh, Jakob Nicolaus Foerster
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform extensive experimental evaluation of our proposed method across a suite of continuous control tasks and find that it leads to significantly more efficient and effective retraining in source and target environments than prior work. |
| Researcher Affiliation | Academia | 1University of Oxford, UK 2Carnegie Mellon University, USA. |
| Pseudocode | Yes | Algorithm 1 Reward Shaping with Evolution Strategies and Algorithm 2 Ev IL: Evolution Strategies for Generalisable Imitation are provided in the paper. |
| Open Source Code | No | The paper states: “All our code is implemented in JAX (Bradbury et al., 2018) using the Pure Jax RL (Lu et al., 2022a), Brax (Freeman et al., 2021), and evosax (Lange, 2023) libraries to maximise parallelisation of training.” This lists third-party libraries used, but does not provide a specific link or explicit statement that the code for the authors' method (Ev IL) is open-source or publicly available. |
| Open Datasets | Yes | We conduct our experiments across three distinct Mu Jo Co environments: Hopper, Walker, and Ant. All learners receive 100 trajectories from the expert policy, trained using Proximal Policy Optimisation (PPO) (Schulman et al., 2017) over 5e7 timesteps. |
| Dataset Splits | No | The paper discusses concepts of training and testing environments for transfer learning but does not specify explicit dataset splits (e.g., percentages or counts) for training, validation, or testing data within an environment. |
| Hardware Specification | No | The paper mentions “SS was supported by Google TPU Research Cloud (TRC) and Google Cloud Research Credits program” in the acknowledgements, indicating the use of TPUs, but it does not specify the exact TPU model (e.g., TPU v2, v3) or any other specific hardware components like GPU or CPU models. |
| Software Dependencies | No | The paper states: “All our code is implemented in JAX (Bradbury et al., 2018) using the Pure Jax RL (Lu et al., 2022a), Brax (Freeman et al., 2021), and evosax (Lange, 2023) libraries to maximise parallelisation of training.” While it lists software names, it does not provide specific version numbers for any of these dependencies. |
| Experiment Setup | Yes | All learners receive 100 trajectories from the expert policy, trained using Proximal Policy Optimisation (PPO) (Schulman et al., 2017) over 5e7 timesteps. Appendix C provides detailed hyperparameters in Table 2 ('Hyperparameters for Training IRL') and Table 3 ('Important parameters for Training Reward Shaping with ES'), including values for 'Number of Reward Hidden Layers', 'Size of Reward Hidden Layer', 'Inner Loop Learning Rate', and 'Population Size'. |