Kernel Based Progressive Distillation for Adder Neural Networks

Authors: Yixing Xu, Chang Xu, Xinghao Chen, Wei Zhang, Chunjing XU, Yunhe Wang

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The effectiveness of the proposed method for learning ANN with higher performance is then well-verified on several benchmarks. In this section, we conduct experiments on several computer vision benchmark datasets, including CIFAR-10, CIFAR-100 and Image Net.
Researcher Affiliation Collaboration Yixing Xu1, Chang Xu2, Xinghao Chen1, Wei Zhang1, Chunjing Xu1, Yunhe Wang 1 / 1Noah s Ark Lab, Huawei Technologies 2The University of Sydney
Pseudocode Yes Algorithm 1 PKKD: Progressive Kernel Based Knowledge Distillation. / Input: A CNN network Nc, an ANN Na, number of intermediate layers M, input feature map xm a , xm c and weight f m a , f m c in the m-th layer. Training set {X, Y}. / 1: repeat / 2: Randomly select a batch of data {xi, yi}n i=1 from {X, Y}, where n is the batchsize; / 3: for m = 1, , M do: / 4: Calculate the ANN output in the m-th layer xm a f m a ; / 5: Transform the output feature of ANN using Eq. 11 to obtain ym a ; / 6: Calculate the CNN output in the l-th layer xm c f m c ; / 7: Transform the output feature of CNN using Eq. 12 to obtain ym c ; / 8: end for / 9: Calculate the loss function Lmid in Eq. 13; / 10: Obtain the softmax outputs of CNN and ANN and denote as yc and ya, respectively. / 11: Compute the loss function Lblend in Eq. 4; / 12: Apply the KD loss L = βLmid + Lblend for Na; / 13: Calculate the normal cross entropy loss Lce = Pn i=1 Hcross(yi c, yi) for Nc; / 14: Update parameters in Nc and Na using Lce and L, respectively; / 15: until converge / Output: The resulting ANN Na with excellent performance.
Open Source Code No The paper does not provide any specific links to source code, nor does it explicitly state that the code for the described methodology is released or available in supplementary materials.
Open Datasets Yes In this section, we conduct experiments on several computer vision benchmark datasets, including CIFAR-10, CIFAR-100 and Image Net.
Dataset Splits No The paper states the total number of training and test images for CIFAR-10/100 (50k training, 10k test) and Image Net (1.2M training, 50k test) but does not explicitly provide details for a validation split or its methodology.
Hardware Specification Yes The batchsize is set to 256, and the experiments are conducted on 8 NVIDIA Tesla V100 GPUs.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup Yes An initial learning rate of 0.1 is set to both CNN and ANN, and a cosine learning rate scheduler is used in training. Both models are trained for 400 epochs with a batchsize of 256. During the experiment we set hyper-parameters α = β {0.1, 0.5, 1, 5, 10}, and the best result among them is picked. The teacher and student models are trained for 150 epochs with an initial learning rate of 0.1 and a cosine learning rate decay scheduler. The weight decay and momentum are set to 0.0001 and 0.9, respectively. The batchsize is set to 256...