Continual evaluation for lifelong learning: Identifying the stability gap

Authors: Matthias De Lange, Gido M van de Ven, Tinne Tuytelaars

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically we show that experience replay, constraintbased replay, knowledge-distillation, and parameter regularization methods are all prone to the stability gap; and that the stability gap can be observed in class-, task-, and domain-incremental learning benchmarks. Additionally, a controlled experiment shows that the stability gap increases when tasks are more dissimilar. Finally, by disentangling gradients into plasticity and stability components, we propose a conceptual explanation for the stability gap.
Researcher Affiliation Academia Matthias De Lange, Gido M. van de Ven & Tinne Tuytelaars KU Leuven
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Contributions in this work are along three main lines, with code publicly available.1 ... 1Code: https://github.com/mattdl/Continual Evaluation
Open Datasets Yes For experiments on class-incremental learning we use three standard datasets: MNIST (Le Cun & Cortes, 2010) consists of grayscale handwritten digits, CIFAR10 (Krizhevsky et al., 2009) contains images from a range of vehicles and animals, and Mini Imagenet (Vinyals et al., 2016) is a subset of Imagenet (Russakovsky et al., 2015). ... For domain-incremental learning we consider drastic domain changes in Mini-Domain Net (Zhou et al., 2021), a scaled-down subset of 126 classes of Domain Net (Peng et al., 2019)... Synthetic Speech Commands dataset (Buchner, 2017)
Dataset Splits Yes To make sure our worst-case analysis applies to the best-case configuration for ER, we run a gridsearch over different hyperparameters and select the entry with the highest stability-plasticity trade-off metric ACC on the held-out evaluation data (Lopez-Paz & Ranzato, 2017).
Hardware Specification No All results were performed on a compute cluster with a range of NVIDIA GPU s.
Software Dependencies No The experiments were based on the Avalanche framework (Lomonaco et al., 2021) in Pytorch (Paszke et al., 2019). The versions for Avalanche and Pytorch are not specified.
Experiment Setup Yes Setup. We employ continual evaluation with evaluation periodicity in range ρeval {1, 10, 102, 103} and subset size 1k per evaluation task... Split-MNIST uses an MLP with 2 hidden layers of 400 units. Split CIFAR10, Split-Mini Imagenet and Mini-Domain Net use a slim version of Resnet18 (Lopez-Paz & Ranzato, 2017). SGD optimization is used with 0.9 momentum. For all experiments, the learning rate η for the gradient-based updates is considered as hyperparameter in the set η {0.1, 0.01, 0.001, 0.0001}. A fixed batch size is used for all benchmarks, with 128 for the larger-scale Split-Mini Imagenet and Mini-Domain Net, and 256 for the smaller Split-MNIST and Split-CIFAR10. ...We indicate the selected hyperparameters (η, α, |M|) per dataset here: Split-MNIST (0.01, 0.3, 2 103), Split-CIFAR10 (0.1, 0.7, 103), Split-Mini Imagenet (0.1, 0.5, 104), Mini-Domain Net (0.1, 0.3, 103).