Refining Minimax Regret for Unsupervised Environment Design

Authors: Michael Beukman, Samuel Coward, Michael Matthews, Mattie Fellows, Minqi Jiang, Michael D Dennis, Jakob Nicolaus Foerster

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we empirically demonstrate that the problems identified in Section 3 do occur, and that Re Mi Di alleviates these issues. First, in Section 6.2, we illustrate some of the failure cases of ideal UED in a simple tabular setting. Next, in Section 6.3, we experiment in the canonical Minigrid domain. In Section 6.4, we consider a different setting where regret-based UED results in a policy that performs poorly over a large subset of levels. Finally, we evaluate on a robotics task in Section 6.5.
Researcher Affiliation Academia 1University of Oxford 2University College London 3UC Berkeley.
Pseudocode Yes Algorithm 1 Refining Minimax Regret Distributions
Open Source Code Yes We publicly release our code at https://github.com/Michael-Beukman/Re Mi Di.
Open Datasets Yes We next consider Minigrid, a common benchmark in UED (Dennis et al., 2020; Jiang et al., 2021a; Parker-Holder et al., 2022). Our final experimental domain is robotics, using Brax (Todorov et al., 2012; Freeman et al., 2021). We evaluate the agent on a set of held-out standard test mazes used in prior work (Jiang et al., 2021a; Parker-Holder et al., 2022; Jiang et al., 2023). In particular, we use Sixteen Rooms, Sixteen Rooms2, Labyrinth, Labyrinth Flipped, Labyrinth2, Standard Maze, Standard Maze2, Standard Maze3, Small Corridor and Large Corridor.
Dataset Splits No The paper mentions 'a standard set of held-out mazes' for evaluation and running experiments with '10 seeds' but does not provide specific train/validation/test dataset splits, proportions, or methodologies for how the data was partitioned for training versus validation.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory, or specific cloud instances used for running experiments.
Software Dependencies No The paper mentions software components like 'Jax UED', 'Brax', 'PPO', and 'LSTM' but does not specify their version numbers or other specific ancillary software dependencies required for replication.
Experiment Setup Yes Appendix D.3 'Hyperparameter Tuning' and Table 9 'Hyperparameters' provide specific values for parameters such as 'PPO Number of Updates', 'γ', 'λGAE', 'PPO epochs', 'Adam learning rate', 'entropy coefficient', etc., for Minigrid, Lever, and Brax experiments.