No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery
Authors: Alexander Rutherford, Michael Beukman, Timon Willi, Bruno Lacerda, Nick Hawes, Jakob Foerster
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find that this simple and intuitive approach outperforms existing UED methods in several binary-outcome environments, including the standard domain of Minigrid and a novel setting closely inspired by a real-world robotics problem. We further introduce a new adversarial evaluation procedure for directly measuring robustness, closely mirroring the conditional value at risk (CVa R). We open-source all our code and present visualisations of final policies here: https://github.com/amacrutherford/sampling-for-learnability. |
| Researcher Affiliation | Academia | Alex Rutherford Michael Beukman Timon Willi Bruno Lacerda Nick Hawes Jakob Foerster University of Oxford |
| Pseudocode | Yes | Algorithm 1 Sampling For Learnability Initialize: policy πϕ, level buffer D while not converged do D collect_learnable_levels(πϕ) Using Alg. 2 for t = 1, . . . , T do Dt ρ NL levels sampled uniformly from D Dt Dt (1 ρ) NL randomly generated levels Collect π s trajectory on Dt and update ϕ end for end while |
| Open Source Code | Yes | We open-source all our code and present visualisations of final policies here: https://github.com/amacrutherford/sampling-for-learnability. |
| Open Datasets | Yes | We use four domains for our experiments, Jax Nav in single-agent mode, Jax Nav in multi-agent mode, the common UED domain Minigrid [13] and XLand-Minigrid [12]. See Appendix B for more details about the environments. |
| Dataset Splits | No | The paper describes "hand-designed test sets" and "randomly sampled set" for evaluation, but does not explicitly provide details about train/validation/test data splits in the conventional supervised learning sense, such as percentages or counts for a validation set. |
| Hardware Specification | Yes | Each individual seed was each run on 1 Nvidia L40s using a server which has 8 NVIDIA L40s , two AMD EPYC 9554 processors (128 cores in total) and 768GB of RAM. These times are without logging, and we find that with logging, SFL is around 6% slower than ACCEL on single-agent Jax Nav. For multi-agent Jax Nav, we compare each method using the same number of PPO updates. The multi-agent results were run on a variety of machines, including the aforementioned L40s system, a similar system featuring NVIDIA A40 s and a workstation containing 2 RTX 4090 s. On a 4090, a SFL run takes 1d 1h 13m 54s while ACCEL takes 18h 17m 26s. |
| Software Dependencies | No | Recently, Bradbury et al. [18] released JAX, a Python numpy-like library that allows computations to run natively on accelerators (such as GPUs and TPUs). This has enabled researchers to run experiments that used to take weeks in a few hours [22, 23]. One side effect of this, however, is that current UED libraries are written in JAX, meaning they are primarily compatible with the (relatively small) set of JAX environments. |
| Experiment Setup | Yes | Table 4 contains the hyperparameters we use, with their selection process for each domain outlined below. We tuned PPO for DR for each domain and then used these same PPO parameters for all methods, tuning only UED-specific parameters. |