Continual evaluation for lifelong learning: Identifying the stability gap
Authors: Matthias De Lange, Gido M van de Ven, Tinne Tuytelaars
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically we show that experience replay, constraintbased replay, knowledge-distillation, and parameter regularization methods are all prone to the stability gap; and that the stability gap can be observed in class-, task-, and domain-incremental learning benchmarks. Additionally, a controlled experiment shows that the stability gap increases when tasks are more dissimilar. Finally, by disentangling gradients into plasticity and stability components, we propose a conceptual explanation for the stability gap. |
| Researcher Affiliation | Academia | Matthias De Lange, Gido M. van de Ven & Tinne Tuytelaars KU Leuven |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Contributions in this work are along three main lines, with code publicly available.1 ... 1Code: https://github.com/mattdl/Continual Evaluation |
| Open Datasets | Yes | For experiments on class-incremental learning we use three standard datasets: MNIST (Le Cun & Cortes, 2010) consists of grayscale handwritten digits, CIFAR10 (Krizhevsky et al., 2009) contains images from a range of vehicles and animals, and Mini Imagenet (Vinyals et al., 2016) is a subset of Imagenet (Russakovsky et al., 2015). ... For domain-incremental learning we consider drastic domain changes in Mini-Domain Net (Zhou et al., 2021), a scaled-down subset of 126 classes of Domain Net (Peng et al., 2019)... Synthetic Speech Commands dataset (Buchner, 2017) |
| Dataset Splits | Yes | To make sure our worst-case analysis applies to the best-case configuration for ER, we run a gridsearch over different hyperparameters and select the entry with the highest stability-plasticity trade-off metric ACC on the held-out evaluation data (Lopez-Paz & Ranzato, 2017). |
| Hardware Specification | No | All results were performed on a compute cluster with a range of NVIDIA GPU s. |
| Software Dependencies | No | The experiments were based on the Avalanche framework (Lomonaco et al., 2021) in Pytorch (Paszke et al., 2019). The versions for Avalanche and Pytorch are not specified. |
| Experiment Setup | Yes | Setup. We employ continual evaluation with evaluation periodicity in range ρeval {1, 10, 102, 103} and subset size 1k per evaluation task... Split-MNIST uses an MLP with 2 hidden layers of 400 units. Split CIFAR10, Split-Mini Imagenet and Mini-Domain Net use a slim version of Resnet18 (Lopez-Paz & Ranzato, 2017). SGD optimization is used with 0.9 momentum. For all experiments, the learning rate η for the gradient-based updates is considered as hyperparameter in the set η {0.1, 0.01, 0.001, 0.0001}. A fixed batch size is used for all benchmarks, with 128 for the larger-scale Split-Mini Imagenet and Mini-Domain Net, and 256 for the smaller Split-MNIST and Split-CIFAR10. ...We indicate the selected hyperparameters (η, α, |M|) per dataset here: Split-MNIST (0.01, 0.3, 2 103), Split-CIFAR10 (0.1, 0.7, 103), Split-Mini Imagenet (0.1, 0.5, 104), Mini-Domain Net (0.1, 0.3, 103). |