Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics

Authors: Vinay Venkatesh Ramasesh, Ethan Dyer, Maithra Raghu

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental With experiments on split CIFAR-10, a novel distribution-shift CIFAR-100 variant, Celeb A and Image Net we analyze neural network layer representations, finding that higher layers are disproportionately responsible for catastrophic forgetting, the sequential training process erasing earlier task subspaces. We investigate different methods for mitigating forgetting, finding that while all stabilize higher layer representations, some methods encourage greater feature reuse in higher layers, while others store task representations as orthogonal subspaces, preventing interference. We study the connection between forgetting and task semantics, finding that semantic similarity between subsequent tasks consistently controls the degree of forgetting. Informed by the representation results, we construct an analytic model that relates task similarity to representation interference and forgetting. This provides a quantitative empirical measure of task similarity, and together these show that forgetting is most severe for tasks with intermediate similarity.
Researcher Affiliation Industry Vinay V. Ramasesh Blueshift, Alphabet Mountain View, CA Ethan Dyer Blueshift, Alphabet Mountain View, CA Maithra Raghu Google Brain Mountain View, CA
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code No The paper does not provide any statement or link indicating the availability of open-source code for the described methodology.
Open Datasets Yes Tasks: We conduct this study over many different tasks and datasets: (i) Split CIFAR-10, where the ten class dataset is split into two tasks of 5 classes each (ii) input distribution shift CIFAR-100, where each task is to distinguish between the CIFAR-100 superclasses, but input data for each task is a different subset of the constituent classes of the superclass (see Appendix A.3) (iii) Celeb A attribute prediction: the two tasks have input data either men or women, and we predict either smile or mouth open (iv) Image Net superclass prediction, similar to CIFAR100.
Dataset Splits No The paper describes the datasets used and how tasks were defined (e.g., split CIFAR-10 into two tasks of 5 classes), but it does not specify the train/validation/test splits (e.g., percentages or exact counts) for the datasets.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, or cloud computing instances) used for running the experiments.
Software Dependencies No The paper mentions using 'cross-entropy loss with SGD with momentum (β = 0.9)' but does not specify any programming languages, libraries, or frameworks with their version numbers that were used for the implementation.
Experiment Setup Yes All networks are trained using cross-entropy loss with SGD with momentum (β = 0.9), using a batch size 128. We do not use learning-rate schedules here, leaving that investigation to future work. To better correspond with practical situations, however, we choose a learning rate and total number of epochs such that interpolation (of the training set) occurs. For the split CIFAR-10 task, this typically corresponds to training for 30 epochs per task with a learning rate of 0.01 (VGG), 0.03 (Res Net), and 0.03 (Dense Net). We use weight decay with strength 1e-4, and do not apply any data augmentation.