KD-Zero: Evolving Knowledge Distiller for Any Teacher-Student Pairs

Authors: Lujun Li, Peijie Dong, Anggeng Li, Zimian Wei, Ya Yang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments reveal that KD-Zero consistently outperforms other state-of-the-art methods across diverse architectures on classification, detection, and segmentation tasks.
Researcher Affiliation Collaboration HKUST HKUST(GZ) Huawei NUST City U
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks with explicit labels like 'Pseudocode' or 'Algorithm'.
Open Source Code Yes Codes are available in supplementary materials.
Open Datasets Yes We utilize the CIFAR-100 dataset [30] in knowledge distillation. ... We additionally conduct experiments on the Image Net[12]. ... We conduct experiments on the MS-COCO dataset[41]. ... We evaluate KD-Zero on Cityscapes dataset[10].
Dataset Splits Yes During the distiller search phase, we apply 5% early-stopping training epochs with full training data for acceleration settings. ... Besides the accuracy metric of the validation set.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running its experiments.
Software Dependencies No The paper mentions optimizers like 'Adam W optimizer' and 'SGD optimizer' but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes The multi-step learning rate commences at 0.1, which decays by 0.1 at 100 and 150 epochs. ... The training is conducted on 224 224 resolution images for 300 epochs, with an initial learning rate of 5e-4 and a weight decay 0.05 using the Adam W optimizer. ... all models are trained with a 2 learning schedule (24 epochs). We train all the models with SGD optimizer, where the momentum is 0.9, and the weight decay is 0.0001. ... During distillation, the batch size is 8, and the models are trained for 40K iterations with the SGD optimizer, where the momentum is 0.9 and the weight decay is 0.0005.