Knowledge Distillation Based on Transformed Teacher Matching

Authors: Kaixiang Zheng, EN-HUI YANG

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiment results demonstrate that thanks to this inherent regularization, TTM leads to trained students with better generalization than the original KD. To further enhance student s capability to match teacher s power transformed probability distribution, we introduce a sample-adaptive weighting coefficient into TTM, yielding a novel distillation approach dubbed weighted TTM (WTTM). It is shown, by comprehensive experiments, that although WTTM is simple, it is effective, improves upon TTM, and achieves state-of-the-art accuracy performance.
Researcher Affiliation Academia Kaixiang Zheng & En-Hui Yang Department of Electrical and Computer Engineering, University of Waterloo {k56zheng,ehyang}@uwaterloo.ca
Pseudocode Yes In this section, we provide the pseudo-code for TTM and WTTM in a Pytorch-like style, shown in Algorithm 1. It s clear that both TTM and WTTM are quite easy to implement.
Open Source Code Yes Our source code is available at https://github.com/zkxufo/TTM.
Open Datasets Yes We benchmark TTM and WTTM on two prevailing image classification datasets, namely CIFAR100 and Image Net (Deng et al., 2009).
Dataset Splits Yes CIFAR-100 contains 60k 32 32 color images of 100 classes, with 600 images per class, and it s further split into 50k training images and 10k test images.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) are mentioned for the experimental setup.
Software Dependencies No The paper mentions 'torchdistill (Matsubara, 2021) library' and 'Py Torch (Paszke et al., 2019)' but does not specify their version numbers or other software dependencies with versions.
Experiment Setup Yes Note that we list T and β values of all experiments in A.4 for reproducibility.