Rethinking Momentum Knowledge Distillation in Online Continual Learning
Authors: Nicolas Michel, Maorong Wang, Ling Xiao, Toshihiko Yamasaki
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we analyze the challenges in applying KD to OCL and give empirical justifications. We introduce a direct yet effective methodology for applying Momentum Knowledge Distillation (MKD) to many flagship OCL methods and demonstrate its capabilities to enhance existing approaches. In addition to improving existing state-of-the-art accuracy by more than 10% points on Image Net100, we shed light on MKD internal mechanics and impacts during training in OCL. |
| Researcher Affiliation | Academia | 1Univ Gustave Eiffel, CNRS, LIGM, F-77454 Marne-la Vall ee, France 2The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan. |
| Pseudocode | Yes | Algorithm 1 Py Torch-like pseudo-code of our loss to integrate to other baselines. |
| Open Source Code | Yes | The code is available at https://github.com/Nicolas1203/ mkd_ocl. |
| Open Datasets | Yes | We use variations of standard image classification datasets (Krizhevsky, 2009; Le & Yang, 2015; Deng et al., 2009). Specifically, we experimented on CIFAR10, CIFAR100, Tiny Image Net, and Image Net-100. |
| Dataset Splits | No | The paper describes how the datasets are split into tasks but does not provide explicit details on the training, validation, and test splits (e.g., percentages or exact counts for each part, or references to predefined validation splits). |
| Hardware Specification | Yes | For the compared methods, we trained on RTX A5000 and V100 GPUs. |
| Software Dependencies | No | The paper mentions 'Py Torch-like (Paszke et al., 2019) pseudo-code' in Algorithm 1, implying the use of PyTorch, but it does not specify a version number for PyTorch or any other software dependencies with their versions. |
| Experiment Setup | Yes | For all baselines, we perform a small hyperparameter search on CIFAR100, M=5k, applying the determined parameters across other configurations... We use the same hyperparameters when incorporating our loss. Throughout the training process, the streaming batch size is set to 10, and data retrieval from memory is capped at 64. Data augmentation includes random flip, grayscale, color jitter, and random crop. ... Table 10. Hyper-parameters tested for every method on CIFAR100, M=5k, 10 tasks. |