Continual Learning in the Teacher-Student Setup: Impact of Task Similarity

Authors: Sebastian Lee, Sebastian Goldt, Andrew Saxe

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We analyse continual learning in two-layer neural networks by deriving a closed set of equations which predict the test error of the network trained on a succession of tasks using one-pass (or online) SGD, extending. Using these equations, we show that intermediate task similarity leads to greatest forgetting in our model. We disentangle task similarity on the level of features (input-to-hidden weights) and readouts (input-to-hidden weights) and describe the effect of both types of similarity on forgetting and transfer in infinitely wide networks. We find that feature and readout similarity contribute in complex and sometimes non-symmetric ways to a range of forgetting and transfer metrics. We summarise our approach in Fig. 1. In the classical teacher-student setup (illustrated in Fig. 1a), a student neural network is trained on synthetic data where inputs x RD are drawn element-wise i.i.d. from the normal distribution and labels are generated by a teacher network (Gardner & Derrida, 1989). To model continual learning, here we consider a setup with two teachers (denoted by and ), which correspond to two tasks to be learned in succession. We plot the theoretical prediction in Fig. 1c together with a single simulation (crosses); even at moderate input size D = 104, the agreement is good.
Researcher Affiliation Collaboration 1Imperial College, London, UK 2International School of Advanced Studies (SISSA), Trieste, Italy 3Department of Experimental Psychology, University of Oxford, UK 4CIFAR Azrieli Global Scholars program, CIFAR, Toronto, Canada 5Facebook AI Research.
Pseudocode No The paper provides mathematical equations describing the model and its dynamics, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Code for all experiments and ODE simulations can be found at https://github.com/seblee97/student teacher catastrophic
Open Datasets No The paper states: "a student neural network is trained on synthetic data where inputs x RD are drawn element-wise i.i.d. from the normal distribution and labels are generated by a teacher network". This indicates that the data is generated synthetically based on their model setup, not derived from a publicly available dataset. While CIFAR10 and CIFAR100 are mentioned, it's in the context of reproducing empirical findings from *other* works, not for their own experiments.
Dataset Splits No The paper clarifies the training setup for online SGD: "Note in the online SGD setting, there is no distinction between train and test error." Therefore, explicit train/validation/test splits are not applicable or provided in the traditional sense for their online learning setup.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run the experiments. It mentions system size D=10^4 for simulations but not hardware specifications.
Software Dependencies No The paper does not specify particular software dependencies with version numbers, such as programming languages or libraries, that would be needed for reproducibility.
Experiment Setup Yes The paper provides details on the experimental setup, including the use of online stochastic gradient descent, L2 loss, learning rates (αW, αh), initial weight distribution ("i.i.d. from the normal distribution with standard deviation σ0"), and specific model parameters for simulations (e.g., "K = 2 neurons", "M = 1 neuron each", "N = 10000, M = 1, K = 2", "N = 15, M = 1000, K = 250").