Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Continual evaluation for lifelong learning: Identifying the stability gap
Authors: Matthias De Lange, Gido M van de Ven, Tinne Tuytelaars
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically we show that experience replay, constraintbased replay, knowledge-distillation, and parameter regularization methods are all prone to the stability gap; and that the stability gap can be observed in class-, task-, and domain-incremental learning benchmarks. Additionally, a controlled experiment shows that the stability gap increases when tasks are more dissimilar. Finally, by disentangling gradients into plasticity and stability components, we propose a conceptual explanation for the stability gap. |
| Researcher Affiliation | Academia | Matthias De Lange, Gido M. van de Ven & Tinne Tuytelaars KU Leuven |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Contributions in this work are along three main lines, with code publicly available.1 ... 1Code: https://github.com/mattdl/Continual Evaluation |
| Open Datasets | Yes | For experiments on class-incremental learning we use three standard datasets: MNIST (Le Cun & Cortes, 2010) consists of grayscale handwritten digits, CIFAR10 (Krizhevsky et al., 2009) contains images from a range of vehicles and animals, and Mini Imagenet (Vinyals et al., 2016) is a subset of Imagenet (Russakovsky et al., 2015). ... For domain-incremental learning we consider drastic domain changes in Mini-Domain Net (Zhou et al., 2021), a scaled-down subset of 126 classes of Domain Net (Peng et al., 2019)... Synthetic Speech Commands dataset (Buchner, 2017) |
| Dataset Splits | Yes | To make sure our worst-case analysis applies to the best-case configuration for ER, we run a gridsearch over different hyperparameters and select the entry with the highest stability-plasticity trade-off metric ACC on the held-out evaluation data (Lopez-Paz & Ranzato, 2017). |
| Hardware Specification | No | All results were performed on a compute cluster with a range of NVIDIA GPU s. |
| Software Dependencies | No | The experiments were based on the Avalanche framework (Lomonaco et al., 2021) in Pytorch (Paszke et al., 2019). The versions for Avalanche and Pytorch are not specified. |
| Experiment Setup | Yes | Setup. We employ continual evaluation with evaluation periodicity in range ρeval {1, 10, 102, 103} and subset size 1k per evaluation task... Split-MNIST uses an MLP with 2 hidden layers of 400 units. Split CIFAR10, Split-Mini Imagenet and Mini-Domain Net use a slim version of Resnet18 (Lopez-Paz & Ranzato, 2017). SGD optimization is used with 0.9 momentum. For all experiments, the learning rate η for the gradient-based updates is considered as hyperparameter in the set η {0.1, 0.01, 0.001, 0.0001}. A fixed batch size is used for all benchmarks, with 128 for the larger-scale Split-Mini Imagenet and Mini-Domain Net, and 256 for the smaller Split-MNIST and Split-CIFAR10. ...We indicate the selected hyperparameters (η, α, |M|) per dataset here: Split-MNIST (0.01, 0.3, 2 103), Split-CIFAR10 (0.1, 0.7, 103), Split-Mini Imagenet (0.1, 0.5, 104), Mini-Domain Net (0.1, 0.3, 103). |