Critical Learning Periods Emerge Even in Deep Linear Networks
Authors: Michael Kleinman, Alessandro Achille, Stefano Soatto
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show analytically and in simulations that the learning of features is tied to competition between sources. Finally, we extend our analysis to multi-task learning to show that pre-training on certain tasks can damage the transfer performance on new tasks, and show how this depends on the relationship between tasks and the duration of the pre-training stage. To the best of our knowledge, our work provides the first analytically tractable model that sheds light into why critical learning periods emerge in biological and artificial networks1. |
| Researcher Affiliation | Academia | Michael Kleinman1 Alessandro Achille2 Stefano Soatto3 1Stanford University 2Caltech 3UCLA |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code available at: https://github.com/mjkleinman/Critical Period Deep Linear Nets |
| Open Datasets | No | The paper describes synthetic data generation, for example, for the matrix completion problem: 'We constructed an N N matrix M of particular rank R by creating two R N matrix L and L each with entries sampled from a zero-mean Gaussian distribution with standard deviation 1, and then taking M = L T L.' For the multipathway model: 'Each input is 8 dimensional and the output is 15 dimensional, with the input encoded as a one-hot vector, and the output corresponding to the columns of Σyx (rows of Σyx T ).' It does not provide concrete access information like links or citations for a publicly available dataset. |
| Dataset Splits | No | The paper mentions 'training set' but does not specify dataset splits (e.g., percentages, counts, or predefined splits) for training, validation, and test sets. It primarily focuses on analytical derivations and simulations of learning dynamics. |
| Hardware Specification | Yes | All experiments in the paper can be reproduced on a local computer in around 7 hours. We used a 2017 Macbook Pro (3.1 GHz Quad-Core Intel Core i7). |
| Software Dependencies | No | The paper mentions implicitly using PyTorch ('.detach() method in Py Torch'), but does not provide specific version numbers for any software libraries or dependencies. |
| Experiment Setup | Yes | We trained a depth Da = Db = 4 network, with 100 units per layer of each pathway using SGD with a constant learning rate of 0.01 using the squared error loss (Eq. 4). We initialize each pathway independently with p(0) N(0, 0.012) and q2(0) p2(0) = 1. We trained with SGD with constant learning rate of 0.2 by using batch gradient descent and minimizing the loss in Eq. 9 averaged over observed entries. We initialized components by setting the standard deviation for each parameter in the deep matrix factorization to be σ = 1/N * g^(1/D), where g sets the initial scale, and N refers to the number of columns (or rows) of the square matrices Wi. We set the number of observed entries to be 2000. We trained networks for a variable duration on the first task, and then subsequently trained the network for a fixed number of epochs of the final task (30000 additional epochs). |