Feature-map-level Online Adversarial Knowledge Distillation
Authors: Inseop Chung, Seonguk Park, Jangho Kim, Nojun Kwak
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, to show the adequacy of our method, we first present comparison experiment with distance method and ablation study to analyze our method. Then we compare our approach with existing online knowledge distillation methods under different settings. We demonstrate the comparison experiment results of using the same network architectures in Sec 4.3 and then apply our method on networks with different architectures in Sec 4.4. In Sec 4.5, we also show the results of training more than two networks to demonstrate that our method generalizes well even when the number of networks increases. In most of experiments, we use the CIFAR-100 (Krizhevsky et al.) dataset. It consists of 50K training images and 10K test images over 100 classes, accordingly it has 600 images per each class. All the reported results on CIFAR-100 are average of 5 experiments. |
| Researcher Affiliation | Academia | 1Graduate School of Convergence Science and Technology, Seoul National University, Seoul, South Korea. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | In most of experiments, we use the CIFAR-100 (Krizhevsky et al.) dataset. It consists of 50K training images and 10K test images over 100 classes, accordingly it has 600 images per each class. ... We use Image Net LSVRC 2015 (Russakovsky et al., 2015) which has 1.2M training images and 50K validation images over 1,000 classes. |
| Dataset Splits | No | The paper mentions "1.2M training images and 50K validation images over 1,000 classes" for ImageNet, but for CIFAR-100, it only states "50K training images and 10K test images" without explicitly mentioning a validation split or its size/percentage. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions optimizers like SGD and ADAM, but does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9). |
| Experiment Setup | Yes | For overall learning schedule, we follow the learning schedule of ONE (Lan et al., 2018) which is 300 epochs of training to conduct fair comparison. In terms of logit-based loss, the learning rate starts at 0.1 and is multiplied by 0.1 at 150, 225 epoch. We optimize the logit-based loss using SGD with mini-batch size of 128, momentum 0.9 and weight-decay of 1e-4. ... For feature map-based loss, the learning rate starts at 2e-5 for both discriminators and feature extractors and is decayed by 0.1 at 75, 150 epoch. The feature map-based loss is optimized by ADAM (Kingma & Ba, 2014) with the same mini-batch size of 128 and weight-decay of 1e-1. |