Wide Neural Networks Forget Less Catastrophically

Authors: Seyed Iman Mirzadeh, Arslan Chaudhry, Dong Yin, Huiyi Hu, Razvan Pascanu, Dilan Gorur, Mehrdad Farajtabar

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate that increasing the width alone reduces catastrophic forgetting significantly, while it s not the case for depth. We provide potential explanations that are consistent with the empirical results across different architectures and continual learning benchmarks.
Researcher Affiliation Collaboration Seyed Iman Mirzadeh * 1 Washington State University 2 Deep Mind. Correspondence to: Seyed Iman Mirzadeh <seyediman.mirzadeh@wsu.edu>, Mehrdad Farajtabar <farajtabar@google.com>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about open-sourcing the code for the described methodology or a link to a code repository.
Open Datasets Yes Benchmarks. We report our findings on two standard continual learning benchmarks: Rotated MNIST and Split CIFAR-100. In Rotated MNIST, each task is generated by the continual rotation of the MNIST images for degrees 0, 22.5, 45, 67.5, and 90, respectively, constituting 5 different tasks. For Split CIFAR-100 each task contains the data from 5 random classes (without replacement), resulting in 20 tasks.
Dataset Splits Yes The experimental setup, such as benchmarks, network architectures, continual learning setting (e.g., number of tasks, episodic memory size, and training epochs per task), hyperparameters, and evaluation metrics are chosen to be similar to several other studies (Chaudhry et al., 2019; Farajtabar et al., 2020; Mirzadeh et al., 2021). To ensure that the reported results are not more favorable to a specific architecture, we use a grid of hyper-parameters, and for each model, we report the results with the best hyperparameters. For all experiments, we report the average and standard deviation over five runs with different random seeds for network initialization. (1) Average Accuracy: The average validation accuracy after the model has been continually trained for T tasks, is defined as: i=1 a T,i (3.1) where, at,i is the validation accuracy on the dataset of task i after the model finished learning task t.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running the experiments.
Software Dependencies No The paper mentions some hyperparameters and training optimizers (SGD) but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) required to replicate the experiments.
Experiment Setup Yes The experimental setup, such as benchmarks, network architectures, continual learning setting (e.g., number of tasks, episodic memory size, and training epochs per task), hyperparameters, and evaluation metrics are chosen to be similar to several other studies (Chaudhry et al., 2019; Farajtabar et al., 2020; Mirzadeh et al., 2021). To ensure that the reported results are not more favorable to a specific architecture, we use a grid of hyper-parameters, and for each model, we report the results with the best hyperparameters. For all experiments, we report the average and standard deviation over five runs with different random seeds for network initialization. In addition to random seed for initialization, for Split CIFAR-100, we use three different seeds for each run where each seed corresponds to the random order in which classes are selected for each task. We provide more details about our experimental design in Appendix A. 1. learning rate: [0.001, 0.01 (MNIST), 0.05 (CIFAR), 0.1] 2. batch size: [16, 32 (CIFAR), 64 (MLP)] 3. SGD momentum: [0.0 (MNIST), 0.8 (CIFAR)] 4. weight decay: [0.0 (MNIST), 0.0001 (CIFAR)]