What Makes a "Good" Data Augmentation in Knowledge Distillation - A Statistical Perspective

Authors: Huan Wang, Suhas Lohit, Michael N. Jones, Yun Fu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical studies support our claims and demonstrate how we can harvest considerable performance gains simply by using a better DA scheme in knowledge distillation. Presenting such a theoretically sound metric and empirically validating its effectiveness is the goal of this paper.
Researcher Affiliation Collaboration 1Northeastern University, Boston, MA 2MERL, Cambridge, MA This paper originates from Huan s summer internship work at MERL.
Pseudocode No The paper describes its methods and proposed schemes in prose (e.g., "Concretely, given a batch of data, we first apply Cut Mix...", "The idea is partly inspired by active learning..."), but does not include any formally structured pseudocode or algorithm blocks.
Open Source Code Yes Project: http://huanwang.tech/Good-DA-in-KD. We include the code link.
Open Datasets Yes We evaluate our method primarily on the CIFAR100 [21] and Tiny Image Net* datasets. CIFAR100 has 100 object classes (32 32 RGB images). Each class has 500 images for training and 100 images for testing. Tiny Image Net is a small version of Image Net [10] with 200 classes (64 64 RGB images). Each class has 500 images for training, 50 for validation and 50 for testing. *https://tiny-imagenet.herokuapp.com/
Dataset Splits Yes CIFAR100 has 100 object classes (32 32 RGB images). Each class has 500 images for training and 100 images for testing. Tiny Image Net is a small version of Image Net [10] with 200 classes (64 64 RGB images). Each class has 500 images for training, 50 for validation and 50 for testing.
Hardware Specification No The paper states: "We use Py Torch [31] to conduct all our experiments." However, it does not specify any details about the hardware (e.g., GPU model, CPU type) used for these experiments.
Software Dependencies No The paper mentions: "We use Py Torch [31] to conduct all our experiments." but does not specify any version numbers for PyTorch or other software dependencies.
Experiment Setup Yes The temperature τ of knowledge distillation is set to 4 following CRD [43]. Loss weight α = 0.9 (Eq. (1)). For CIFAR100 and Tiny Image Net, training batch size is 64; the original number of total training epochs is 240, with learning rate (LR) decayed at epoch 150, 180, and 210 by multiplier 0.1. The initial LR is 0.05. For prolonged training, we train for 480 epochs instead of 960 to save time.