Momentum Adversarial Distillation: Handling Large Distribution Shifts in Data-Free Knowledge Distillation

Authors: Kien Do, Thai Hung Le, Dung Nguyen, Dang Nguyen, HARIPRIYA HARIKUMAR, Truyen Tran, Santu Rana, Svetha Venkatesh

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on six benchmark datasets including big datasets like Image Net and Places365 demonstrate the superior performance of MAD over competing methods for handling the large distribution shift problem. Our method also compares favorably to existing DFKD methods and even achieves state-of-the-art results in some cases. 5 Experiments
Researcher Affiliation Academia Kien Do, Hung Le, Dung Nguyen, Dang Nguyen, Haripriya Harikumar, Truyen Tran, Santu Rana, Svetha Venkatesh Applied Artificial Intelligence Institute (A2I2), Deakin University, Australia
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing source code for the described methodology or links to a code repository.
Open Datasets Yes We consider the image classification task and evaluate our proposed method on 3 small image datasets (CIFAR10 [24], CIFAR100 [24], Tiny Image Net [25]), and 3 large image datasets (Image Net [9], Places365 [51], Food101 [4]).
Dataset Splits No The paper lists standard datasets like CIFAR10, CIFAR100, and Image Net but does not explicitly specify the train/validation/test splits used for these datasets in the experimental setup.
Hardware Specification No The paper does not specify the exact hardware (e.g., specific GPU or CPU models) used for running the experiments.
Software Dependencies No The paper mentions using PyTorch and optimizers like SGD and Adam, but it does not specify version numbers for these software components or any other libraries.
Experiment Setup Yes If not otherwise specified, we set the momentum α in Eq. 5 to 0.95 and the length of the noise vector to 256. We train the student S using SGD and Adam for the small and large datasets, respectively. We train the generator G using Adam for both the small and large datasets.