Maslow’s Hammer in Catastrophic Forgetting: Node Re-Use vs. Node Activation

Authors: Sebastian Lee, Stefano Sarao Mannelli, Claudia Clopath, Sebastian Goldt, Andrew Saxe

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we theoretically analyse both a synthetic teacher-student framework and a real data setup to provide an explanation of this phenomenon that we name Maslow s hammer hypothesis. Our analysis reveals the presence of a trade-off between node activation and node re-use that results in worst forgetting in the intermediate regime. Using this understanding we reinterpret popular algorithmic interventions for catastrophic interference in terms of this trade-off, and identify the regimes in which they are most effective.
Researcher Affiliation Academia 1Imperial College, London, UK 2Sainsbury Wellcome Centre, UCL 3Gatsby Computational Neuroscience Unit, UCL 4International School of Advanced Studies (SISSA), Trieste, Italy 5CIFAR Azrieli Global Scholars program, CIFAR, Toronto, Canada.
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes Code: github.com/seblee97/student teacher catastrophic
Open Datasets Yes For the data mixing framework, we use the Fashion Modified National Institute of Standards and Technology (MNIST) dataset, with the following parameters: D1: class 0, class 5; D2: class 2, class 7;
Dataset Splits No The paper mentions an 'early stopping regime such that the weights used for the network in the second task are those with the lowest test error obtained during the first phase of training.' This implies a validation step, but it does not specify explicit dataset splits (e.g., percentages, counts) for training, validation, and testing.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions software components like 'SGD optimiser' and 'Mean squared error loss', but it does not specify any version numbers for these or any other software libraries (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes Unless mentioned otherwise in the main text, the following parameters were used in all teacher-student runs: Input dimension = 1000; Test set size = 50,000; SGD optimiser; Mean squared error loss; Teacher weight initialisation: normal distribution with variance 1; Student weight initialisation: normal distribution with variance 0.001; Student hidden dimension: 4; Teacher hidden dimension: 2; Learning rate: 0.1; Nonlinearity = scaled error function. For the data mixing framework, we use the Fashion Modified National Institute of Standards and Technology (MNIST) dataset, with the following parameters: D1: class 0, class 5; D2: class 2, class 7; SGD optimiser; Mean squared error loss; Batch size = 1; Input dimension = 1024; Hidden dimension = 8; Nonlinearity = sigmoid; Learning rate: 0.001.