Wasserstein Distance Rivals Kullback-Leibler Divergence for Knowledge Distillation

Authors: Jiaming Lv, Haoyuan Yang, Peihua Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive evaluations on image classification and object detection have shown (1) for logit distillation WKD-L outperforms very strong KL-Div variants; (2) for feature distillation WKD-F is superior to the KL-Div counterparts and state-of-the-art competitors. The source code is available at http://peihuali.org/WKD. We evaluate WKD for image classification on Image Net [41] and CIFAR-100 [42]. Also, we evaluate the effectiveness of WKD on self-knowledge distillation (Self-KD). Further, we extend WKD to object detection and conduct experiment on MS-COCO [43].
Researcher Affiliation Academia Jiaming Lv , Haoyuan Yang , Peihua Li Dalian University of Technology ljm_vlg@mail.dlut.edu.cn, yanghaoyuan@mail.dlut.edu.cn, peihuali@dlut.edu.cn
Pseudocode No The paper describes its methods using mathematical formulations and textual explanations, but it does not contain formally labeled pseudocode or algorithm blocks.
Open Source Code Yes The source code is available at http://peihuali.org/WKD.
Open Datasets Yes We evaluate WKD for image classification on Image Net [41] and CIFAR-100 [42]. ... Further, we extend WKD to object detection and conduct experiment on MS-COCO [43].
Dataset Splits Yes Image Net [41] contains 1,000 categories with 1.28M images for training, 50K images for validation and 100K for testing.
Hardware Specification Yes We train and test models with Py Torch framework [44], using a PC with an Intel Core i9-13900K CPU and Ge Force RTX 4090 GPUs.
Software Dependencies No The paper mentions using 'Py Torch framework [44]' and 'POT library [45]' but does not provide specific version numbers for these or any other software dependencies needed for reproducibility.
Experiment Setup Yes In accordance with [25], we train the models for 100 epochs using SGD optimizer with a batch size of 256, a momentum of 0.9 and a weight decay of 1e−4. The initial learning rate is 0.1, divided by 10 at the 30th, 60th and 90th epochs, respectively. ... For WKD-L, we use POT library [45] for solving discrete WD with η=0.05 and 9 iterations. For WKD-F, the projector has a bottleneck structure, i.e., a 1 × 1 Convolution (Conv) and a 3 × 3 Conv both with 256 filters followed by a 1 × 1 Conv with BN and Re LU to match the size of teacher s feature maps.