Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
The Importance of Being Lazy: Scaling Limits of Continual Learning
Authors: Jacopo Graldi, Alessandro Breccia, Giulia Lanzillotta, Thomas Hofmann, Lorenzo Noci
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we perform a systematic study on the impact of model scale and the degree of feature learning in continual learning. We show that increasing model width is only beneficial when it reduces the amount of feature learning, yielding more laziness. Using the framework of dynamical mean field theory, we then study the infinite width dynamics of the model in the feature learning regime and characterize CF, extending prior theoretical results limited to the lazy regime. We study the intricate relationship between feature learning, task non-stationarity, and forgetting, finding that high feature learning is only beneficial with highly similar tasks. We identify a transition modulated by task similarity where the model exits an effectively lazy regime with low forgetting to enter a rich regime with significant forgetting. Finally, our findings reveal that neural networks achieve optimal performance at a critical level of feature learning, which depends on task non-stationarity and transfers across model scales. This work provides a unified perspective on the role of scale and feature learning in continual learning. In Fig. 2 we show the CFr for varying width and parameterization. |
| Researcher Affiliation | Academia | 1Dept. of Information Technology and Electrical Engineering, ETH Zurich, Switzerland 2Dept. of Physics and Astronomy, University of Padua, Italy 3Dept. of Computer Science, ETH Zurich 4ETH AI Center. |
| Pseudocode | No | The paper describes the model architecture and experimental setup verbally and mathematically. It does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | In most of our experiments, we utilize a Res Net architecture (with base width N = 64 and depth L = 6), trained with Stochastic Gradient Descent (SGD) and evaluated on MNIST, CIFAR10, and Tiny Imagenet and their suitable adaptions to the continual learning setting. |
| Dataset Splits | Yes | The Split-CIFAR10 (Zenke et al., 2017) (TIL type of benchmark) dataset has 5 tasks of 2 classes each (i.e. the 10 classes of CIFAR10 are split into 5 tasks with non-overlapping classes), and as common for TIL benchmarks, the model uses a separate head for each task. ... Permuted MNIST... each task of this benchmark consists of the MNIST dataset with a random but fixed permutation of the pixels. ... Split-Tiny Imagenet... we consider varying the number of tasks and classes-per-task: 5 tasks of 2 classes each (we denote it 5/2), 5/10, 5/40, as well as 20 tasks of 2 classes each (20/2), and 20/10. For the experiments involving the infinite-width simulations of the network dynamics, instead, we will use a simpler non-linear two-layer perceptron (MLP henceforth), on a small subset of MNIST with 30 samples, suitably modified as a 2-tasks CL benchmark. |
| Hardware Specification | Yes | All training runs and experiments are executed either on a single NVIDIA Ge Force RTX 4090 or on an NVIDIA RTX A6000. |
| Software Dependencies | No | The paper mentions Py Torch and the Py Hessian library but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | In all experiments if not otherwise specified we use Stochastic Gradient Descent (SGD) optimization without momentum nor weight decay. To find the learning rate η0(0), we do a hyperparameter search on the model at base width and depth (i.e. where all parameterizations are equivalent) on the full (i.e. non-CL) dataset taking the optimal test accuracy as a metric. We use a cosine learning rate schedule without a warmup, restarting the learning rate at the beginning of each task. For all datasets, we use a batch size of 128. The Split-CIFAR10 ... We train for 5 epochs on each task with a learning rate of η0(0) = 30.0. ... Permuted MNIST ... Each task is trained for 5 epochs with η0(0) = 2.0. ... Infinite-Width Experiments on 2-layer non-linear MLP ... Each task is optimized for 1000 epochs and a fixed LR of η0 = 0.25. To avoid the explosion of the initialization output at low values of γ0, we initialize the last layer of the MLP to 0 to have a well-defined output of 0 at t = 0. ... SPLIT-TINYIMAGENET ... We optimize each task for 10 epochs with η0(0) = 15.0. |