Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks
Authors: Róbert Csordás, Sjoerd van Steenkiste, Jürgen Schmidhuber
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using this powerful tool, we contribute an extensive study of emerging modularity in NNs that covers several standard architectures and datasets. We demonstrate how common NNs fail to reuse submodules and offer new insights into the related issue of systematic generalization on language tasks. |
| Researcher Affiliation | Collaboration | Róbert Csordás IDSIA / USI / SUPSI robert@idsia.ch Sjoerd van Steenkiste IDSIA / USI / SUPSI sjoerd@idsia.ch Jürgen Schmidhuber IDSIA / USI / SUPSI / NNAISENSE juergen@idsia.ch |
| Pseudocode | No | The paper describes its method mathematically but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code for all experiments is available at https://github.com/Robert Csordas/modules. |
| Open Datasets | Yes | SCAN dataset (Lake & Baroni, 2018), Mathematics Dataset (Saxton et al., 2019), CIFAR10 (Krizhevsky et al., 2009), permuted MNIST benchmark (Kirkpatrick et al., 2017; Golkar et al., 2019; Kolouri et al., 2019) |
| Dataset Splits | Yes | We randomly choose 10k samples for the new validation set; the rest is used as the new train set. |
| Hardware Specification | No | The paper mentions 'hardware donations from NVIDIA & IBM' and that experiments 'fit on a single GPU with 16Gb of VRAM (2 GPUs for Poly. collect )', but does not specify exact GPU models (e.g., RTX 3090, A100), CPU models, or other detailed hardware components. |
| Software Dependencies | No | The paper states 'Our method is implemented in Py Torch (Paszke et al., 2019)' but does not provide specific version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | Unless otherwise noted we use the Adam optimizer (Kingma & Ba, 2015), a batch size of 128, a learning rate of 10 3, and gradient clipping of 1. The FNN is 5 layers deep, each layer having 2000 units and the LSTM a hidden state size of 256... Mask training uses a learning rate of 10 2 and β = 10 4 for regularization. Table 4: Hyperparameters for different tasks on the Mathematics Dataset |