reproducibilityindex.ai

Wide Neural Networks Forget Less Catastrophically

Authors: Seyed Iman Mirzadeh, Arslan Chaudhry, Dong Yin, Huiyi Hu, Razvan Pascanu, Dilan Gorur, Mehrdad Farajtabar

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate that increasing the width alone reduces catastrophic forgetting signiﬁcantly, while it s not the case for depth. We provide potential explanations that are consistent with the empirical results across different architectures and continual learning benchmarks.
Researcher Affiliation	Collaboration	Seyed Iman Mirzadeh * 1 Washington State University 2 Deep Mind. Correspondence to: Seyed Iman Mirzadeh <seyediman.mirzadeh@wsu.edu>, Mehrdad Farajtabar <farajtabar@google.com>.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about open-sourcing the code for the described methodology or a link to a code repository.
Open Datasets	Yes	Benchmarks. We report our ﬁndings on two standard continual learning benchmarks: Rotated MNIST and Split CIFAR-100. In Rotated MNIST, each task is generated by the continual rotation of the MNIST images for degrees 0, 22.5, 45, 67.5, and 90, respectively, constituting 5 different tasks. For Split CIFAR-100 each task contains the data from 5 random classes (without replacement), resulting in 20 tasks.
Dataset Splits	Yes	The experimental setup, such as benchmarks, network architectures, continual learning setting (e.g., number of tasks, episodic memory size, and training epochs per task), hyperparameters, and evaluation metrics are chosen to be similar to several other studies (Chaudhry et al., 2019; Farajtabar et al., 2020; Mirzadeh et al., 2021). To ensure that the reported results are not more favorable to a speciﬁc architecture, we use a grid of hyper-parameters, and for each model, we report the results with the best hyperparameters. For all experiments, we report the average and standard deviation over ﬁve runs with different random seeds for network initialization. (1) Average Accuracy: The average validation accuracy after the model has been continually trained for T tasks, is deﬁned as: i=1 a T,i (3.1) where, at,i is the validation accuracy on the dataset of task i after the model ﬁnished learning task t.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running the experiments.
Software Dependencies	No	The paper mentions some hyperparameters and training optimizers (SGD) but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) required to replicate the experiments.
Experiment Setup	Yes	The experimental setup, such as benchmarks, network architectures, continual learning setting (e.g., number of tasks, episodic memory size, and training epochs per task), hyperparameters, and evaluation metrics are chosen to be similar to several other studies (Chaudhry et al., 2019; Farajtabar et al., 2020; Mirzadeh et al., 2021). To ensure that the reported results are not more favorable to a speciﬁc architecture, we use a grid of hyper-parameters, and for each model, we report the results with the best hyperparameters. For all experiments, we report the average and standard deviation over ﬁve runs with different random seeds for network initialization. In addition to random seed for initialization, for Split CIFAR-100, we use three different seeds for each run where each seed corresponds to the random order in which classes are selected for each task. We provide more details about our experimental design in Appendix A. 1. learning rate: [0.001, 0.01 (MNIST), 0.05 (CIFAR), 0.1] 2. batch size: [16, 32 (CIFAR), 64 (MLP)] 3. SGD momentum: [0.0 (MNIST), 0.8 (CIFAR)] 4. weight decay: [0.0 (MNIST), 0.0001 (CIFAR)]