reproducibilityindex.ai

KD-Zero: Evolving Knowledge Distiller for Any Teacher-Student Pairs

Authors: Lujun Li, Peijie Dong, Anggeng Li, Zimian Wei, Ya Yang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments reveal that KD-Zero consistently outperforms other state-of-the-art methods across diverse architectures on classification, detection, and segmentation tasks.
Researcher Affiliation	Collaboration	HKUST HKUST(GZ) Huawei NUST City U
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks with explicit labels like 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	Codes are available in supplementary materials.
Open Datasets	Yes	We utilize the CIFAR-100 dataset [30] in knowledge distillation. ... We additionally conduct experiments on the Image Net[12]. ... We conduct experiments on the MS-COCO dataset[41]. ... We evaluate KD-Zero on Cityscapes dataset[10].
Dataset Splits	Yes	During the distiller search phase, we apply 5% early-stopping training epochs with full training data for acceleration settings. ... Besides the accuracy metric of the validation set.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running its experiments.
Software Dependencies	No	The paper mentions optimizers like 'Adam W optimizer' and 'SGD optimizer' but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup	Yes	The multi-step learning rate commences at 0.1, which decays by 0.1 at 100 and 150 epochs. ... The training is conducted on 224 224 resolution images for 300 epochs, with an initial learning rate of 5e-4 and a weight decay 0.05 using the Adam W optimizer. ... all models are trained with a 2 learning schedule (24 epochs). We train all the models with SGD optimizer, where the momentum is 0.9, and the weight decay is 0.0001. ... During distillation, the batch size is 8, and the models are trained for 40K iterations with the SGD optimizer, where the momentum is 0.9 and the weight decay is 0.0005.