Effect of scale on catastrophic forgetting in neural networks
Authors: Vinay Venkatesh Ramasesh, Aitor Lewkowycz, Ethan Dyer
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we present an empirical study of catastrophic forgetting in this pretraining paradigm. Our experiments indicate that large, pretrained Res Nets and Transformers are significantly more resistant to forgetting than randomly-initialized, trained-from-scratch models; this robustness systematically improves with scale of both model and pretraining dataset size. |
| Researcher Affiliation | Industry | Vinay Ramasesh, Aitor Lewkowycz, and Ethan Dyer Google Research, Blueshift {ramasesh,alewkowycz,edyer}@google.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing its source code nor provides a link to a code repository. |
| Open Datasets | Yes | Tasks: Most of the experiments in this paper are conducted in the standard, task split setting (Kirkpatrick et al., 2017; Zenke et al., 2017). We consider split CIFAR-10 task sequences the ten class dataset is split and sequentially trained on two tasks of 5 classes each. We also consider sequential 10 and 50 class splits of CIFAR-100. Beyond split CIFAR, we study the following datasets: Oxford-IIIT pet, Oxford Flowers 102, Street View House Numbers (SVHN), Caltech-UCSD Birds 200 (CUB-200), Cars196, Domainnet/Clipart. [...] For our language model experiments, we consider next token prediction generative tasks and use the IMDb Reviews (Maas et al., 2011) and english Wikipedia (Foundation) datasets. [...] Unless otherwise specified, we pretrain our image models in a supervised manner on the Image Net21k dataset, which contains approximately 14 million images in about 26000 distinct categories using the Adam optimizer (Kingma & Ba, 2017). |
| Dataset Splits | No | The paper describes how datasets are split into tasks for sequential training (e.g., 'two tasks of 5 classes each' for CIFAR-10) and evaluates on test sets, but it does not explicitly state specific training, validation, and test dataset splits with percentages or sample counts for the overall dataset. |
| Hardware Specification | No | The paper does not specify the exact hardware (e.g., GPU, CPU models, or TPU versions) used for running the experiments. |
| Software Dependencies | Yes | We use the version of Image Net21k available in Tensor Flow Datasets 4.0.1 |
| Experiment Setup | Yes | Pretraining is done using the Adam optimizer (β1 = 0.9 and β2 = 0.999); for all models, we use a batch size of 4096. Our learning-rate schedule includes a warmup of 10k steps to a maximum learning rate η = 10 −3, followed by a linear decay to 10 −5. For the Vision Transformer models, we use a weight decay penalty of 0.03 and a dropout rate of 0.1; for Res Net models, we do not use weight decay or dropout. |