reproducibilityindex.ai

Rethinking Momentum Knowledge Distillation in Online Continual Learning

Authors: Nicolas Michel, Maorong Wang, Ling Xiao, Toshihiko Yamasaki

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we analyze the challenges in applying KD to OCL and give empirical justifications. We introduce a direct yet effective methodology for applying Momentum Knowledge Distillation (MKD) to many flagship OCL methods and demonstrate its capabilities to enhance existing approaches. In addition to improving existing state-of-the-art accuracy by more than 10% points on Image Net100, we shed light on MKD internal mechanics and impacts during training in OCL.
Researcher Affiliation	Academia	1Univ Gustave Eiffel, CNRS, LIGM, F-77454 Marne-la Vall ee, France 2The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan.
Pseudocode	Yes	Algorithm 1 Py Torch-like pseudo-code of our loss to integrate to other baselines.
Open Source Code	Yes	The code is available at https://github.com/Nicolas1203/ mkd_ocl.
Open Datasets	Yes	We use variations of standard image classification datasets (Krizhevsky, 2009; Le & Yang, 2015; Deng et al., 2009). Specifically, we experimented on CIFAR10, CIFAR100, Tiny Image Net, and Image Net-100.
Dataset Splits	No	The paper describes how the datasets are split into tasks but does not provide explicit details on the training, validation, and test splits (e.g., percentages or exact counts for each part, or references to predefined validation splits).
Hardware Specification	Yes	For the compared methods, we trained on RTX A5000 and V100 GPUs.
Software Dependencies	No	The paper mentions 'Py Torch-like (Paszke et al., 2019) pseudo-code' in Algorithm 1, implying the use of PyTorch, but it does not specify a version number for PyTorch or any other software dependencies with their versions.
Experiment Setup	Yes	For all baselines, we perform a small hyperparameter search on CIFAR100, M=5k, applying the determined parameters across other configurations... We use the same hyperparameters when incorporating our loss. Throughout the training process, the streaming batch size is set to 10, and data retrieval from memory is capped at 64. Data augmentation includes random flip, grayscale, color jitter, and random crop. ... Table 10. Hyper-parameters tested for every method on CIFAR100, M=5k, 10 tasks.