Maslow’s Hammer in Catastrophic Forgetting: Node Re-Use vs. Node Activation
Authors: Sebastian Lee, Stefano Sarao Mannelli, Claudia Clopath, Sebastian Goldt, Andrew Saxe
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper we theoretically analyse both a synthetic teacher-student framework and a real data setup to provide an explanation of this phenomenon that we name Maslow s hammer hypothesis. Our analysis reveals the presence of a trade-off between node activation and node re-use that results in worst forgetting in the intermediate regime. Using this understanding we reinterpret popular algorithmic interventions for catastrophic interference in terms of this trade-off, and identify the regimes in which they are most effective. |
| Researcher Affiliation | Academia | 1Imperial College, London, UK 2Sainsbury Wellcome Centre, UCL 3Gatsby Computational Neuroscience Unit, UCL 4International School of Advanced Studies (SISSA), Trieste, Italy 5CIFAR Azrieli Global Scholars program, CIFAR, Toronto, Canada. |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code: github.com/seblee97/student teacher catastrophic |
| Open Datasets | Yes | For the data mixing framework, we use the Fashion Modified National Institute of Standards and Technology (MNIST) dataset, with the following parameters: D1: class 0, class 5; D2: class 2, class 7; |
| Dataset Splits | No | The paper mentions an 'early stopping regime such that the weights used for the network in the second task are those with the lowest test error obtained during the first phase of training.' This implies a validation step, but it does not specify explicit dataset splits (e.g., percentages, counts) for training, validation, and testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like 'SGD optimiser' and 'Mean squared error loss', but it does not specify any version numbers for these or any other software libraries (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | Unless mentioned otherwise in the main text, the following parameters were used in all teacher-student runs: Input dimension = 1000; Test set size = 50,000; SGD optimiser; Mean squared error loss; Teacher weight initialisation: normal distribution with variance 1; Student weight initialisation: normal distribution with variance 0.001; Student hidden dimension: 4; Teacher hidden dimension: 2; Learning rate: 0.1; Nonlinearity = scaled error function. For the data mixing framework, we use the Fashion Modified National Institute of Standards and Technology (MNIST) dataset, with the following parameters: D1: class 0, class 5; D2: class 2, class 7; SGD optimiser; Mean squared error loss; Batch size = 1; Input dimension = 1024; Hidden dimension = 8; Nonlinearity = sigmoid; Learning rate: 0.001. |