Discovering modular solutions that generalize compositionally
Authors: Simon Schug, Seijin Kobayashi, Yassir Akram, Maciej Wolczyk, Alexandra Maria Proca, Johannes von Oswald, Razvan Pascanu, Joao Sacramento, Angelika Steger
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We further demonstrate empirically that under the theoretically identified conditions, meta-learning from finite data can discover modular policies that generalize compositionally in a number of complex environments. |
| Researcher Affiliation | Collaboration | Simon Schug ETH Zurich Seijin Kobayashi ETH Zurich Yassir Akram ETH Zurich Maciej Wołczyk IDEAS NCBR Alexandra Proca Imperial College London Johannes von Oswald ETH Zurich Google Research Razvan Pascanu Google Deep Mind Jo ao Sacramento ETH Zurich Angelika Steger ETH Zurich |
| Pseudocode | Yes | Algorithm 1 Algorithm to sample the tast latent variable from given a mask. Algorithm 2 Bilevel training procedure |
| Open Source Code | Yes | Code available at https://github.com/smonsays/modular-hyperteacher |
| Open Datasets | No | The paper mentions generating data and using various task distributions but does not provide concrete access information (link, DOI, formal citation) for any specific publicly available or open dataset. |
| Dataset Splits | No | The paper describes training and query datasets without specifying explicit dataset splits (percentages, counts, or references to predefined splits for reproducibility). |
| Hardware Specification | Yes | We used Linux workstations with Nvidia RTX 2080 and Nvidia RTX 3090 GPUs for development and conducted hyperparameter searches and experiments using 5 TPUv2-8, 5 TPUv3-8 and 1 Linux server with 8 Nvidia RTX 3090 GPUs over the course of 9 months. In total, we spent an estimated amount of 6 GPU months. |
| Software Dependencies | Yes | We implemented our experiments in Python using JAX (Bradbury et al., 2018, Apache License 2.0) and the Deepmind Jax Ecosystem (Babuschkin et al., 2020, Apache License 2.0). For experiment tracking we used wandb (Biewald, 2020, MIT license) and for the generation of plots we used plotly (Inc, 2015, MIT license). |
| Experiment Setup | Yes | For all experiments in the multi-task teacher student, we set Bouter = 64, Binner = 256, Nouter = 60000, Ninner = 300. We optimize the inner loop using the Adam optimizer with ηinner = 0.003, and the outer loop using the Adam W optimizer with various weight decay strengths and an initial learning rate ηouter = 0.001 annealed using cosine annealing down to 10^-6 at the end of training. |