Progressive Network Grafting for Few-Shot Knowledge Distillation

Authors: Chengchao Shen, Xinchao Wang, Youtan Yin, Jie Song, Sihui Luo, Mingli Song2541-2549

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that our approach, with only a few unlabeled samples, achieves gratifying results on CIFAR10, CIFAR100, and ILSVRC-2012. On CIFAR10 and CIFAR100, our performances are even on par with those of knowledge distillation schemes that utilize the full datasets.
Researcher Affiliation Collaboration Chengchao Shen,1 Xinchao Wang, 2 Youtan Yin, 1 Jie Song, 1 Sihui Luo, 1 Mingli Song 1, 3* 1 Zhejiang University 2 Stevens Institute of Technology 3 Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies
Pseudocode Yes Algorithm 1 Network Grafting for Few-Shot Knowledge Distillation
Open Source Code Yes The source code is available at https://github.com/zju-vipa/Net Graft.
Open Datasets Yes Datasets and Models. Both CIFAR10 and CIFAR100 are composed of 60, 000 colour images with 32 32 size, where 50, 000 images are used as training set and the rest 10, 000 images are used as test set. The CIFAR 10 dataset contains 10 classes, and the CIFAR100 contains 100 classes. In few-shot setting, we randomly sample K samples per class from the original CIFAR datasets as training set... The ILSVRC-2012 dataset (Russakovsky et al. 2015) contains 1.2 million images as training set, 50,000 images as validation set, from 1,000 categories.
Dataset Splits Yes The ILSVRC-2012 dataset (Russakovsky et al. 2015) contains 1.2 million images as training set, 50,000 images as validation set, from 1,000 categories.
Hardware Specification Yes The proposed method is implemented using Py Torch on a Quadro P6000 24G GPU.
Software Dependencies No The paper mentions 'Py Torch' and 'Adam algorithm' but does not specify their version numbers or other software dependencies with specific versions.
Experiment Setup Yes The batch size is set to 64 for 10-shot training. For K-shot training, we set batch size to 64 K/10 . For all experiments, we adopt Adam algorithm for network optimization. Without extra clarification, the following learning rates work when batch size is 64. For other batch size: B, the learning rate is scaled by the factor: B/64 as (He et al. 2016). The learning rates for block grafting and network grafting on CIFAR10 are set to 2.5 10 4, 1 10 4, respectively. For CIFAR100, the learning rates are 1 10 3, 5 10 5, respectively. Following (Kingma and Ba 2014), we set the weight decay to zero and the running averages of gradient and its square to 0.9 and 0.999, respectively. We adopt the weight initialization proposed by (He et al. 2015). For Res Net18 on ILSVRC-2012, we adopt the same optimizer and weight initialization method as VGG16-half. During block grafting, we set the learning rate of block1 and block2 to 10 4, the one of block3 and block4 to 10 3. During network grafting, the learning rates for block1 2, block1 3 and block1 4 are set to 10 4, 2 10 3 and 10 3, respectively.