Isotonic Data Augmentation for Knowledge Distillation
Authors: Wanyun Cui, Sen Yan
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We have verified on variant datasets and data augmentation techniques that our proposed IDA algorithms effectively increases the accuracy of knowledge distillation by eliminating the rank violations. We show the classification accuracies of the standard knowledge distillation and our proposed isotonic data augmentation in Table 1. |
| Researcher Affiliation | Academia | Shanghai University of Finance and Economics cui.wanyun@sufe.edu.cn |
| Pseudocode | Yes | Algorithm 1 Adapted IRT. |
| Open Source Code | No | The paper does not provide a direct link or explicit statement about the availability of its source code. |
| Open Datasets | Yes | Datasets. We use CIFAR-100 [Krizhevsky et al., 2009], which contains 50k training images with 500 images per class and 10k test images. We also use Image Net, which contains 1.2 million images from 1K classes for training and 50K for validation |
| Dataset Splits | Yes | We use CIFAR-100 [Krizhevsky et al., 2009], which contains 50k training images with 500 images per class and 10k test images. We also use Image Net, which contains 1.2 million images from 1K classes for training and 50K for validation |
| Hardware Specification | Yes | Models for Image Net were trained on 4 Nvidia Tesla V100 GPUs. Models for CIFAR-100 were trained on a single Nvidia Tesla V100 GPU. |
| Software Dependencies | No | The paper mentions using SGD as the optimizer and refers to various models (ResNet, GoogleNet, Bert, Distil Bert) and data augmentation techniques (Mixup, Cut Mix), but does not specify software dependencies with version numbers (e.g., Python 3.x, PyTorch x.x, CUDA x.x). |
| Experiment Setup | Yes | By default, we set β = 3, σ = 2, which are derived from grid search in {0.5, 1, 2, 3, 4, 5}. We set τ = 4.5, α = 0.95 from common practice. For Image Net, we train the student model for 100 epochs. We use SGD as the optimizer with initial learning rate is 0.1. We decay the learning rate by 0.1 at epoch 30, 60, 90. |