No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery

Authors: Alexander Rutherford, Michael Beukman, Timon Willi, Bruno Lacerda, Nick Hawes, Jakob Foerster

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We find that this simple and intuitive approach outperforms existing UED methods in several binary-outcome environments, including the standard domain of Minigrid and a novel setting closely inspired by a real-world robotics problem. We further introduce a new adversarial evaluation procedure for directly measuring robustness, closely mirroring the conditional value at risk (CVa R). We open-source all our code and present visualisations of final policies here: https://github.com/amacrutherford/sampling-for-learnability.
Researcher Affiliation Academia Alex Rutherford Michael Beukman Timon Willi Bruno Lacerda Nick Hawes Jakob Foerster University of Oxford
Pseudocode Yes Algorithm 1 Sampling For Learnability Initialize: policy πϕ, level buffer D while not converged do D collect_learnable_levels(πϕ) Using Alg. 2 for t = 1, . . . , T do Dt ρ NL levels sampled uniformly from D Dt Dt (1 ρ) NL randomly generated levels Collect π s trajectory on Dt and update ϕ end for end while
Open Source Code Yes We open-source all our code and present visualisations of final policies here: https://github.com/amacrutherford/sampling-for-learnability.
Open Datasets Yes We use four domains for our experiments, Jax Nav in single-agent mode, Jax Nav in multi-agent mode, the common UED domain Minigrid [13] and XLand-Minigrid [12]. See Appendix B for more details about the environments.
Dataset Splits No The paper describes "hand-designed test sets" and "randomly sampled set" for evaluation, but does not explicitly provide details about train/validation/test data splits in the conventional supervised learning sense, such as percentages or counts for a validation set.
Hardware Specification Yes Each individual seed was each run on 1 Nvidia L40s using a server which has 8 NVIDIA L40s , two AMD EPYC 9554 processors (128 cores in total) and 768GB of RAM. These times are without logging, and we find that with logging, SFL is around 6% slower than ACCEL on single-agent Jax Nav. For multi-agent Jax Nav, we compare each method using the same number of PPO updates. The multi-agent results were run on a variety of machines, including the aforementioned L40s system, a similar system featuring NVIDIA A40 s and a workstation containing 2 RTX 4090 s. On a 4090, a SFL run takes 1d 1h 13m 54s while ACCEL takes 18h 17m 26s.
Software Dependencies No Recently, Bradbury et al. [18] released JAX, a Python numpy-like library that allows computations to run natively on accelerators (such as GPUs and TPUs). This has enabled researchers to run experiments that used to take weeks in a few hours [22, 23]. One side effect of this, however, is that current UED libraries are written in JAX, meaning they are primarily compatible with the (relatively small) set of JAX environments.
Experiment Setup Yes Table 4 contains the hyperparameters we use, with their selection process for each domain outlined below. We tuned PPO for DR for each domain and then used these same PPO parameters for all methods, tuning only UED-specific parameters.