Feature-map-level Online Adversarial Knowledge Distillation

Authors: Inseop Chung, Seonguk Park, Jangho Kim, Nojun Kwak

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, to show the adequacy of our method, we first present comparison experiment with distance method and ablation study to analyze our method. Then we compare our approach with existing online knowledge distillation methods under different settings. We demonstrate the comparison experiment results of using the same network architectures in Sec 4.3 and then apply our method on networks with different architectures in Sec 4.4. In Sec 4.5, we also show the results of training more than two networks to demonstrate that our method generalizes well even when the number of networks increases. In most of experiments, we use the CIFAR-100 (Krizhevsky et al.) dataset. It consists of 50K training images and 10K test images over 100 classes, accordingly it has 600 images per each class. All the reported results on CIFAR-100 are average of 5 experiments.
Researcher Affiliation Academia 1Graduate School of Convergence Science and Technology, Seoul National University, Seoul, South Korea.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes In most of experiments, we use the CIFAR-100 (Krizhevsky et al.) dataset. It consists of 50K training images and 10K test images over 100 classes, accordingly it has 600 images per each class. ... We use Image Net LSVRC 2015 (Russakovsky et al., 2015) which has 1.2M training images and 50K validation images over 1,000 classes.
Dataset Splits No The paper mentions "1.2M training images and 50K validation images over 1,000 classes" for ImageNet, but for CIFAR-100, it only states "50K training images and 10K test images" without explicitly mentioning a validation split or its size/percentage.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions optimizers like SGD and ADAM, but does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9).
Experiment Setup Yes For overall learning schedule, we follow the learning schedule of ONE (Lan et al., 2018) which is 300 epochs of training to conduct fair comparison. In terms of logit-based loss, the learning rate starts at 0.1 and is multiplied by 0.1 at 150, 225 epoch. We optimize the logit-based loss using SGD with mini-batch size of 128, momentum 0.9 and weight-decay of 1e-4. ... For feature map-based loss, the learning rate starts at 2e-5 for both discriminators and feature extractors and is decayed by 0.1 at 75, 150 epoch. The feature map-based loss is optimized by ADAM (Kingma & Ba, 2014) with the same mini-batch size of 128 and weight-decay of 1e-1.