Discovering modular solutions that generalize compositionally

Authors: Simon Schug, Seijin Kobayashi, Yassir Akram, Maciej Wolczyk, Alexandra Maria Proca, Johannes von Oswald, Razvan Pascanu, Joao Sacramento, Angelika Steger

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We further demonstrate empirically that under the theoretically identified conditions, meta-learning from finite data can discover modular policies that generalize compositionally in a number of complex environments.
Researcher Affiliation Collaboration Simon Schug ETH Zurich Seijin Kobayashi ETH Zurich Yassir Akram ETH Zurich Maciej Wołczyk IDEAS NCBR Alexandra Proca Imperial College London Johannes von Oswald ETH Zurich Google Research Razvan Pascanu Google Deep Mind Jo ao Sacramento ETH Zurich Angelika Steger ETH Zurich
Pseudocode Yes Algorithm 1 Algorithm to sample the tast latent variable from given a mask. Algorithm 2 Bilevel training procedure
Open Source Code Yes Code available at https://github.com/smonsays/modular-hyperteacher
Open Datasets No The paper mentions generating data and using various task distributions but does not provide concrete access information (link, DOI, formal citation) for any specific publicly available or open dataset.
Dataset Splits No The paper describes training and query datasets without specifying explicit dataset splits (percentages, counts, or references to predefined splits for reproducibility).
Hardware Specification Yes We used Linux workstations with Nvidia RTX 2080 and Nvidia RTX 3090 GPUs for development and conducted hyperparameter searches and experiments using 5 TPUv2-8, 5 TPUv3-8 and 1 Linux server with 8 Nvidia RTX 3090 GPUs over the course of 9 months. In total, we spent an estimated amount of 6 GPU months.
Software Dependencies Yes We implemented our experiments in Python using JAX (Bradbury et al., 2018, Apache License 2.0) and the Deepmind Jax Ecosystem (Babuschkin et al., 2020, Apache License 2.0). For experiment tracking we used wandb (Biewald, 2020, MIT license) and for the generation of plots we used plotly (Inc, 2015, MIT license).
Experiment Setup Yes For all experiments in the multi-task teacher student, we set Bouter = 64, Binner = 256, Nouter = 60000, Ninner = 300. We optimize the inner loop using the Adam optimizer with ηinner = 0.003, and the outer loop using the Adam W optimizer with various weight decay strengths and an initial learning rate ηouter = 0.001 annealed using cosine annealing down to 10^-6 at the end of training.