Flattening Sharpness for Dynamic Gradient Projection Memory Benefits Continual Learning
Authors: Danruo DENG, Guangyong Chen, Jianye Hao, Qiong Wang, Pheng-Ann Heng
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | As demonstrated empirically, our proposed method consistently outperforms baselines with the superior ability to learn new skills while alleviating forgetting effectively.2. In this section, we conduct extensive experiments to compare the performance of our proposed FS-DGPM model with the state-of-the-art methods on widely used continual learning benchmark datasets. Additional results and more details about the datasets, experiment setup, baselines, and model architectures are presented in the Appendix D and E. |
| Researcher Affiliation | Collaboration | 1The Chinese University of Hong Kong, 2Guangdong-Hong Kong-Macao Joint Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, 3College of Intelligence and Computing, Tianjin University, 4Huawei Noah s Ark Lab |
| Pseudocode | Yes | Algorithm 1 FS-DGPM (Flattening Sharpness for Dynamic Gradient Projection Memory) |
| Open Source Code | Yes | The code is available at: https://github.com/danruod/FS-DGPM |
| Open Datasets | Yes | We evaluate our algorithm on four image classification datasets: Permuted MNIST (PMNIST) [20], CIFAR-100 Split [18], CIFAR-100 Superclass [44] and Tiny Image Net [37]. |
| Dataset Splits | Yes | Early stopping is used to halt the training with up to 10 epochs for each task based on the validation loss as proposed in [35]. The CIFAR-100 Split is constructed by randomly dividing 100 classes of CIFAR-100 into 10 tasks with 10 classes per task. The CIFAR-100 Superclass is divided into 20 tasks according to the 20 superclasses of the CIFAR-100 dataset, and each superclass contains 5 different but semantically related classes. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions that 'All baselines and our method use stochastic gradient descent (SGD) for training,' but it does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9) needed to replicate the experiment. |
| Experiment Setup | Yes | For each task in PMNIST and Tiny Image Net, we train the network in 1 and 10 epochs, respectively, with the batch size as 10. These experimental settings are the same as La-MAML [13], so that we directly compare with their reported results. In the CIFAR-100 Split and CIFAR-100 Superclass experiments, we use the early termination strategy to train up to 50 epochs for each task, which is based on the validation loss as proposed in [35]. For both datasets, the batch size is set to 64. The replay buffer size of PMNIST, CIFAR-100 Split, CIFAR-100 Superclass, and Tiny Image Net are 200, 1000, 1000, and 400, respectively. Details about the experimental setting and the hyperparameters considered for each baseline are provided in Appendix D.5 and D.6. |