KD-Zero: Evolving Knowledge Distiller for Any Teacher-Student Pairs
Authors: Lujun Li, Peijie Dong, Anggeng Li, Zimian Wei, Ya Yang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments reveal that KD-Zero consistently outperforms other state-of-the-art methods across diverse architectures on classification, detection, and segmentation tasks. |
| Researcher Affiliation | Collaboration | HKUST HKUST(GZ) Huawei NUST City U |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks with explicit labels like 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | Codes are available in supplementary materials. |
| Open Datasets | Yes | We utilize the CIFAR-100 dataset [30] in knowledge distillation. ... We additionally conduct experiments on the Image Net[12]. ... We conduct experiments on the MS-COCO dataset[41]. ... We evaluate KD-Zero on Cityscapes dataset[10]. |
| Dataset Splits | Yes | During the distiller search phase, we apply 5% early-stopping training epochs with full training data for acceleration settings. ... Besides the accuracy metric of the validation set. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running its experiments. |
| Software Dependencies | No | The paper mentions optimizers like 'Adam W optimizer' and 'SGD optimizer' but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | The multi-step learning rate commences at 0.1, which decays by 0.1 at 100 and 150 epochs. ... The training is conducted on 224 224 resolution images for 300 epochs, with an initial learning rate of 5e-4 and a weight decay 0.05 using the Adam W optimizer. ... all models are trained with a 2 learning schedule (24 epochs). We train all the models with SGD optimizer, where the momentum is 0.9, and the weight decay is 0.0001. ... During distillation, the batch size is 8, and the models are trained for 40K iterations with the SGD optimizer, where the momentum is 0.9 and the weight decay is 0.0005. |