Revisit the Essence of Distilling Knowledge through Calibration
Authors: Wen-Shu Fan, Su Lu, Xin-Chun Li, De-Chuan Zhan, Le Gan
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive analytical experiments, we observe a positive correlation between the calibration of the teacher model and the KD performance with original KD methods. We employ CIFAR-10/CIFAR-100 (Krizhevsky et al., 2009) and Tiny-Imagenet (Tavanaei, 2020) datasets for our experiments. |
| Researcher Affiliation | Academia | Wen-Shu Fan 1 2 Su Lu 1 2 Xin-Chun Li 1 2 De-Chuan Zhan 1 2 Le Gan 1 2 1School of Artificial Intelligence, Nanjing University, China 2National Key Laboratory for Novel Software Technology, Nanjing University, China. Correspondence to: De Chuan Zhan <zhandc@nju.edu.cn>. |
| Pseudocode | No | No pseudocode or clearly labeled algorithm block was found in the paper. |
| Open Source Code | No | No explicit statement about the release of open-source code or a direct link to a code repository for the methodology described in this paper was found. |
| Open Datasets | Yes | We employ CIFAR-10/CIFAR-100 (Krizhevsky et al., 2009) and Tiny-Imagenet (Tavanaei, 2020) datasets for our experiments. |
| Dataset Splits | No | The paper describes using test datasets (Table 1) and training for a certain number of epochs, but does not explicitly state training/validation/test dataset splits, specific percentages, or absolute counts for a validation set. It mostly refers to 'test samples' and 'test dataset' for evaluation. |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU types, or memory specifications) used for running experiments are mentioned in the paper. |
| Software Dependencies | No | The paper mentions various models and metrics like ResNet, Wide ResNet, MobileNet V2, KL divergence, t-SNE, Spearman correlation, and Pearson correlation, but does not provide specific version numbers for any software, libraries, or frameworks used. |
| Experiment Setup | Yes | Each training session consists of 200 epochs, and the learning rate is 0.03. the CE loss in Equation (2) is excluded by setting the value of α to 1 in our experiments. We use a temperature parameter τ in the softmax function and {t, s} to calculate pt(τ) and ps(τ): We fix the distillation temperature to 1.8 on the CIFAR-100 dataset. We opt to train the teacher model for only 100 epochs, initiating the learning rate decay after the 50th epoch. |