Isotonic Data Augmentation for Knowledge Distillation

Authors: Wanyun Cui, Sen Yan

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have verified on variant datasets and data augmentation techniques that our proposed IDA algorithms effectively increases the accuracy of knowledge distillation by eliminating the rank violations. We show the classification accuracies of the standard knowledge distillation and our proposed isotonic data augmentation in Table 1.
Researcher Affiliation Academia Shanghai University of Finance and Economics cui.wanyun@sufe.edu.cn
Pseudocode Yes Algorithm 1 Adapted IRT.
Open Source Code No The paper does not provide a direct link or explicit statement about the availability of its source code.
Open Datasets Yes Datasets. We use CIFAR-100 [Krizhevsky et al., 2009], which contains 50k training images with 500 images per class and 10k test images. We also use Image Net, which contains 1.2 million images from 1K classes for training and 50K for validation
Dataset Splits Yes We use CIFAR-100 [Krizhevsky et al., 2009], which contains 50k training images with 500 images per class and 10k test images. We also use Image Net, which contains 1.2 million images from 1K classes for training and 50K for validation
Hardware Specification Yes Models for Image Net were trained on 4 Nvidia Tesla V100 GPUs. Models for CIFAR-100 were trained on a single Nvidia Tesla V100 GPU.
Software Dependencies No The paper mentions using SGD as the optimizer and refers to various models (ResNet, GoogleNet, Bert, Distil Bert) and data augmentation techniques (Mixup, Cut Mix), but does not specify software dependencies with version numbers (e.g., Python 3.x, PyTorch x.x, CUDA x.x).
Experiment Setup Yes By default, we set β = 3, σ = 2, which are derived from grid search in {0.5, 1, 2, 3, 4, 5}. We set τ = 4.5, α = 0.95 from common practice. For Image Net, we train the student model for 100 epochs. We use SGD as the optimizer with initial learning rate is 0.1. We decay the learning rate by 0.1 at epoch 30, 60, 90.