Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks

Authors: Róbert Csordás, Sjoerd van Steenkiste, Jürgen Schmidhuber

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using this powerful tool, we contribute an extensive study of emerging modularity in NNs that covers several standard architectures and datasets. We demonstrate how common NNs fail to reuse submodules and offer new insights into the related issue of systematic generalization on language tasks.
Researcher Affiliation Collaboration Róbert Csordás IDSIA / USI / SUPSI robert@idsia.ch Sjoerd van Steenkiste IDSIA / USI / SUPSI sjoerd@idsia.ch Jürgen Schmidhuber IDSIA / USI / SUPSI / NNAISENSE juergen@idsia.ch
Pseudocode No The paper describes its method mathematically but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code for all experiments is available at https://github.com/Robert Csordas/modules.
Open Datasets Yes SCAN dataset (Lake & Baroni, 2018), Mathematics Dataset (Saxton et al., 2019), CIFAR10 (Krizhevsky et al., 2009), permuted MNIST benchmark (Kirkpatrick et al., 2017; Golkar et al., 2019; Kolouri et al., 2019)
Dataset Splits Yes We randomly choose 10k samples for the new validation set; the rest is used as the new train set.
Hardware Specification No The paper mentions 'hardware donations from NVIDIA & IBM' and that experiments 'fit on a single GPU with 16Gb of VRAM (2 GPUs for Poly. collect )', but does not specify exact GPU models (e.g., RTX 3090, A100), CPU models, or other detailed hardware components.
Software Dependencies No The paper states 'Our method is implemented in Py Torch (Paszke et al., 2019)' but does not provide specific version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes Unless otherwise noted we use the Adam optimizer (Kingma & Ba, 2015), a batch size of 128, a learning rate of 10 3, and gradient clipping of 1. The FNN is 5 layers deep, each layer having 2000 units and the LSTM a hidden state size of 256... Mask training uses a learning rate of 10 2 and β = 10 4 for regularization. Table 4: Hyperparameters for different tasks on the Mathematics Dataset