Wasserstein Distance Rivals Kullback-Leibler Divergence for Knowledge Distillation
Authors: Jiaming Lv, Haoyuan Yang, Peihua Li
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive evaluations on image classification and object detection have shown (1) for logit distillation WKD-L outperforms very strong KL-Div variants; (2) for feature distillation WKD-F is superior to the KL-Div counterparts and state-of-the-art competitors. The source code is available at http://peihuali.org/WKD. We evaluate WKD for image classification on Image Net [41] and CIFAR-100 [42]. Also, we evaluate the effectiveness of WKD on self-knowledge distillation (Self-KD). Further, we extend WKD to object detection and conduct experiment on MS-COCO [43]. |
| Researcher Affiliation | Academia | Jiaming Lv , Haoyuan Yang , Peihua Li Dalian University of Technology ljm_vlg@mail.dlut.edu.cn, yanghaoyuan@mail.dlut.edu.cn, peihuali@dlut.edu.cn |
| Pseudocode | No | The paper describes its methods using mathematical formulations and textual explanations, but it does not contain formally labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code is available at http://peihuali.org/WKD. |
| Open Datasets | Yes | We evaluate WKD for image classification on Image Net [41] and CIFAR-100 [42]. ... Further, we extend WKD to object detection and conduct experiment on MS-COCO [43]. |
| Dataset Splits | Yes | Image Net [41] contains 1,000 categories with 1.28M images for training, 50K images for validation and 100K for testing. |
| Hardware Specification | Yes | We train and test models with Py Torch framework [44], using a PC with an Intel Core i9-13900K CPU and Ge Force RTX 4090 GPUs. |
| Software Dependencies | No | The paper mentions using 'Py Torch framework [44]' and 'POT library [45]' but does not provide specific version numbers for these or any other software dependencies needed for reproducibility. |
| Experiment Setup | Yes | In accordance with [25], we train the models for 100 epochs using SGD optimizer with a batch size of 256, a momentum of 0.9 and a weight decay of 1e−4. The initial learning rate is 0.1, divided by 10 at the 30th, 60th and 90th epochs, respectively. ... For WKD-L, we use POT library [45] for solving discrete WD with η=0.05 and 9 iterations. For WKD-F, the projector has a bottleneck structure, i.e., a 1 × 1 Convolution (Conv) and a 3 × 3 Conv both with 256 filters followed by a 1 × 1 Conv with BN and Re LU to match the size of teacher s feature maps. |