Evolving Curricula with Regret-Based Environment Design
Authors: Jack Parker-Holder, Minqi Jiang, Michael Dennis, Mikayel Samvelyan, Jakob Foerster, Edward Grefenstette, Tim Rocktäschel
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach, which we call Adversarially Compounding Complexity by Editing Levels (ACCEL), seeks to constantly produce levels at the frontier of the student agent s capabilities, resulting in curricula that start simple but become increasingly complex. ACCEL maintains the theoretical benefits of prior regret-based methods, while providing significant empirical gains in a diverse set of environments. An interactive version of this paper is available at https://accelagent.github.io. |
| Researcher Affiliation | Collaboration | *Equal contribution 1Meta AI 2University of Oxford 3UCL 4UC Berkeley. Correspondence to: Jack Parker-Holder <jackph@robots.ox.ac.uk>, Minqi Jiang <msj@fb.com>. |
| Pseudocode | Yes | The full procedure is shown in Algorithm 1. ACCEL can be seen as a UED algorithm taking a step toward open-ended evolution (Stanley et al., 2017), where the evolutionary fitness is estimated regret, as levels only stay in the population (that is, the level replay buffer) if they meet the high-regret criterion for curation. |
| Open Source Code | Yes | An open source implementation of ACCEL reproducing our experiments is available at https://github.com/facebookresearch/dcd. |
| Open Datasets | Yes | We begin with a partially-observable navigation environment, where we test our agents transfer capabilities on human-designed levels. For ACCEL we begin with empty rooms and randomly edit the block locations (by adding or removing blocks), as well as the goal location. The Mini Hack environment is an open-source Gym environment (Brockman et al., 2016), which wraps the game of Net Hack via the Net Hack Learning Environment (Kuttler et al., 2020). |
| Dataset Splits | Yes | For Mini Grid, we follow the protocol from Jiang et al. (2021a) and select the best hyperparameters using the validation levels {16Rooms, Labyrinth, Maze}. The final hyperparameters chosen are shown in Table 11. We tuned the hyperparameters for our base agent using domain randomization, and conducted a sweep over the learning rate {3e-4, 3e-5}, PPO epochs {5, 20}, entropy coefficient {0, 1e-3} and number of minibatches {4, 32}, using the validation performance on Bipedal Walker Hardcore. |
| Hardware Specification | Yes | All training runs used a single V100 GPU, using 10 Intel Xeon E5-2698 v4 CPUs. |
| Software Dependencies | No | The paper mentions using Python, Proximal Policy Optimization (PPO), Adam optimizer, Mini Grid, Mini Hack, and a modified Bipedal Walker environment. However, specific version numbers for these software components or libraries are not provided. |
| Experiment Setup | Yes | For a full list of hyperparameters for each experiment please see Table 11 in Section C.3. Table 11 provides detailed hyperparameters for PPO (e.g., PPO rollout length 256, PPO epochs 5, Adam learning rate 1e-4) and ACCEL/PLR specific settings (e.g., Buffer size 10000, Replay rate 0.9, Number of edits 5). |