Make Continual Learning Stronger via C-Flat
Authors: Ang Bian, Wei Li, Hangjie Yuan, yu chengrong, Mang Wang, Zixiang Zhao, Aojun Lu, Pengliang Ji, Tao Feng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the performance on CIFAR-100, Image Net-100 and Tiny-Image Net. ... Table 1 empirically demonstrates the superiority of our method: Makes Continual Learning Stronger. ... As shown in Figure 2, models trained with vanilla-SGD exhibit higher maximal Hessian eigenvalues ... while our method induces a significant drop in Hessian eigenvalues ... leading to flatter minima. ... We perform ablation study in two cases: (i) the influence of λ and ρ on different CL methods; (ii) the influence of ρ and its scheduler on different optimizers. |
| Researcher Affiliation | Collaboration | Ang Bian 1, Wei Li 1,2, Hangjie Yuan 3,4, Chengrong Yu1, Mang Wang5, Zixiang Zhao6, Aojun Lu1, Pengliang Ji7, Tao Feng 2 1Sichuan University 2Tsinghua University 3DAMO Academy, Alibaba Group 4Zhejiang University 5Byte Dance 6Xi an Jiaotong University 7Carnegie Mellon University |
| Pseudocode | Yes | Algorithm 1 C-Flat Optimization ... Algorithm 2 C-Flat for GPM-family at T > 1 |
| Open Source Code | Yes | Code is available at https://github.com/WanNaa/C-Flat. |
| Open Datasets | Yes | Datasets. We evaluate the performance on CIFAR-100, Image Net-100 and Tiny-Image Net. Adherence to [66, 67], the random seed for class-order shuffling is fixed at 1993. |
| Dataset Splits | Yes | Subsequently, we follow two typical class splits in CIL: (i) Divide all Yb classes equally into B phases, denoted as B0_Incy; (ii) Treat half of the total classes as initial phases, followed by an equally division of the remaining classes into incremental phases, denoted as B50_Incy. In both settings, y denotes that learns y new classes per task. |
| Hardware Specification | No | The paper does not specify the types of GPUs, CPUs, or other specific hardware used for running the experiments. It only generally refers to 'compute resources' in the NeurIPS checklist. |
| Software Dependencies | No | The paper mentions 'a vanilla-SGD optimizer [71]' and 'Py Hessian [59]' but does not provide specific version numbers for these or other key software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For a given dataset, we study all methods using the same network architecture following repo [66, 67], i.e. Res Net-32 for CIFAR and Res Net-18 for Image Net. If not specified otherwise, the hyper-parameters for all models adhere to the settings in the open-source library [66, 67]. Each task are initialized with the same ρ and η, which drops with iterations according to the scheduler from [70]. To ensure a fair comparison, all models are trained with a vanilla-SGD optimizer [71]. And the proposed method is plugged into the SGD. ... λ controls the strength of the C-Flat penalty ... ρ controls the step length of gradient ascent. |