Revisit the Essence of Distilling Knowledge through Calibration

Authors: Wen-Shu Fan, Su Lu, Xin-Chun Li, De-Chuan Zhan, Le Gan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive analytical experiments, we observe a positive correlation between the calibration of the teacher model and the KD performance with original KD methods. We employ CIFAR-10/CIFAR-100 (Krizhevsky et al., 2009) and Tiny-Imagenet (Tavanaei, 2020) datasets for our experiments.
Researcher Affiliation Academia Wen-Shu Fan 1 2 Su Lu 1 2 Xin-Chun Li 1 2 De-Chuan Zhan 1 2 Le Gan 1 2 1School of Artificial Intelligence, Nanjing University, China 2National Key Laboratory for Novel Software Technology, Nanjing University, China. Correspondence to: De Chuan Zhan <zhandc@nju.edu.cn>.
Pseudocode No No pseudocode or clearly labeled algorithm block was found in the paper.
Open Source Code No No explicit statement about the release of open-source code or a direct link to a code repository for the methodology described in this paper was found.
Open Datasets Yes We employ CIFAR-10/CIFAR-100 (Krizhevsky et al., 2009) and Tiny-Imagenet (Tavanaei, 2020) datasets for our experiments.
Dataset Splits No The paper describes using test datasets (Table 1) and training for a certain number of epochs, but does not explicitly state training/validation/test dataset splits, specific percentages, or absolute counts for a validation set. It mostly refers to 'test samples' and 'test dataset' for evaluation.
Hardware Specification No No specific hardware details (like GPU models, CPU types, or memory specifications) used for running experiments are mentioned in the paper.
Software Dependencies No The paper mentions various models and metrics like ResNet, Wide ResNet, MobileNet V2, KL divergence, t-SNE, Spearman correlation, and Pearson correlation, but does not provide specific version numbers for any software, libraries, or frameworks used.
Experiment Setup Yes Each training session consists of 200 epochs, and the learning rate is 0.03. the CE loss in Equation (2) is excluded by setting the value of α to 1 in our experiments. We use a temperature parameter τ in the softmax function and {t, s} to calculate pt(τ) and ps(τ): We fix the distillation temperature to 1.8 on the CIFAR-100 dataset. We opt to train the teacher model for only 100 epochs, initiating the learning rate decay after the 50th epoch.