Overcoming Catastrophic Forgetting for Continual Learning via Model Adaptation

Authors: Wenpeng Hu, Zhou Lin, Bing Liu, Chongyang Tao, Zhengwei Tao, Jinwen Ma, Dongyan Zhao, Rui Yan

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments have been carried out to demonstrate the effectiveness of the proposed approach. Experiments conducted using two image datasets (MNIST and CIFAR-10) and two text datasets (DBPedia ontology (Lehmann et al., 2015) and THUCNews (Li et al., 2006)) show that the proposed approach PGMA works well for different scenarios and different types of datasets, and outperforms the existing strong baselines markedly.
Researcher Affiliation Academia Wenpeng Hu1,2, Zhou Lin1, Bing Liu3, Chongyang Tao2, Zhengwei Tao2, Dongyan Zhao2, Jinwen Ma1, and Rui Yan2, 1Department of Information Science, School of Mathematical Sciences, Peking University 2ICST, Peking University, Beijing, China 3Department of Computer Science, University of Illinois at Chicago {wenpeng.hu,jokerlin,chongyangtao,tttzw,zhaody,ruiyan}@pku.edu.cn liub@uic.edu jwma@math.pku.edu.cn
Pseudocode Yes Algorithm 1 PGMA (Parameter Generation and Model Adaptation) training
Open Source Code Yes 6https://github.com/morning-dews/PGMA tensorflow
Open Datasets Yes Two image datasets: (1) MNIST: this dataset consists of 70,000 images of handwritten digits from 0 to 9. (2) CIFAR-10: this dataset consists of 60,000 32x32 color images of 10 classes, with 6000 images per class. Two text datasets: (1) DBPedia ontology: this is a crowd-sourced dataset (Lehmann et al., 2015) with 560,000 training samples and 70,000 test samples. (2) THUCNews: this dataset consists of 65,000 sentences of 10 classes (Li et al., 2006).
Dataset Splits Yes MNIST: We use 60,000/3000/7000 images for training/validation/testing respectively. CIFAR-10: There are 50,000/3000/7000 images for training/validation/testing respectively. DBPedia ontology: We use 10,000 for validation and 60,000 for test. THUCNews: We randomly select 50,000/5000/10,000 sentences for training/validation/testing respectively.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU or CPU models. It only discusses time and memory efficiency in general terms.
Software Dependencies No The paper mentions using 'Adam algorithm to update the parameters' and 'TensorFlow' (implied by the GitHub link), but it does not specify exact version numbers for these or any other software dependencies.
Experiment Setup Yes Training Details: For fair comparison, our proposed approach uses the same solver (or classifier) as the baselines. That is, a multilayer perceptron is adopted as the solver/classifier (as the baselines all use this method), which is a 3-layer network (i.e., two basic units with each hidden layer as a unit) followed by a softmax layer. For our approach, the total number of parameters in the solver includes both the generated parameters p and the shared parameters θ0. We use a 3-layer perceptron (with 2 hidden layers) network (we also call it T-net) for DPG and set the size of each hidden layer to 1000. Each T-net can generate 100 parameters at a time. The network parameters are updated using the Adam algorithm with a learning rate of 0.001. Appendix B details specific settings for different datasets like hidden layer sizes, dropout rates, and percentages of parameters replaced.