reproducibilityindex.ai

Effect of scale on catastrophic forgetting in neural networks

Authors: Vinay Venkatesh Ramasesh, Aitor Lewkowycz, Ethan Dyer

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we present an empirical study of catastrophic forgetting in this pretraining paradigm. Our experiments indicate that large, pretrained Res Nets and Transformers are signiﬁcantly more resistant to forgetting than randomly-initialized, trained-from-scratch models; this robustness systematically improves with scale of both model and pretraining dataset size.
Researcher Affiliation	Industry	Vinay Ramasesh, Aitor Lewkowycz, and Ethan Dyer Google Research, Blueshift {ramasesh,alewkowycz,edyer}@google.com
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statement about releasing its source code nor provides a link to a code repository.
Open Datasets	Yes	Tasks: Most of the experiments in this paper are conducted in the standard, task split setting (Kirkpatrick et al., 2017; Zenke et al., 2017). We consider split CIFAR-10 task sequences the ten class dataset is split and sequentially trained on two tasks of 5 classes each. We also consider sequential 10 and 50 class splits of CIFAR-100. Beyond split CIFAR, we study the following datasets: Oxford-IIIT pet, Oxford Flowers 102, Street View House Numbers (SVHN), Caltech-UCSD Birds 200 (CUB-200), Cars196, Domainnet/Clipart. [...] For our language model experiments, we consider next token prediction generative tasks and use the IMDb Reviews (Maas et al., 2011) and english Wikipedia (Foundation) datasets. [...] Unless otherwise speciﬁed, we pretrain our image models in a supervised manner on the Image Net21k dataset, which contains approximately 14 million images in about 26000 distinct categories using the Adam optimizer (Kingma & Ba, 2017).
Dataset Splits	No	The paper describes how datasets are split into tasks for sequential training (e.g., 'two tasks of 5 classes each' for CIFAR-10) and evaluates on test sets, but it does not explicitly state specific training, validation, and test dataset splits with percentages or sample counts for the overall dataset.
Hardware Specification	No	The paper does not specify the exact hardware (e.g., GPU, CPU models, or TPU versions) used for running the experiments.
Software Dependencies	Yes	We use the version of Image Net21k available in Tensor Flow Datasets 4.0.1
Experiment Setup	Yes	Pretraining is done using the Adam optimizer (β1 = 0.9 and β2 = 0.999); for all models, we use a batch size of 4096. Our learning-rate schedule includes a warmup of 10k steps to a maximum learning rate η = 10 −3, followed by a linear decay to 10 −5. For the Vision Transformer models, we use a weight decay penalty of 0.03 and a dropout rate of 0.1; for Res Net models, we do not use weight decay or dropout.