Kernel Based Progressive Distillation for Adder Neural Networks
Authors: Yixing Xu, Chang Xu, Xinghao Chen, Wei Zhang, Chunjing XU, Yunhe Wang
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness of the proposed method for learning ANN with higher performance is then well-verified on several benchmarks. In this section, we conduct experiments on several computer vision benchmark datasets, including CIFAR-10, CIFAR-100 and Image Net. |
| Researcher Affiliation | Collaboration | Yixing Xu1, Chang Xu2, Xinghao Chen1, Wei Zhang1, Chunjing Xu1, Yunhe Wang 1 / 1Noah s Ark Lab, Huawei Technologies 2The University of Sydney |
| Pseudocode | Yes | Algorithm 1 PKKD: Progressive Kernel Based Knowledge Distillation. / Input: A CNN network Nc, an ANN Na, number of intermediate layers M, input feature map xm a , xm c and weight f m a , f m c in the m-th layer. Training set {X, Y}. / 1: repeat / 2: Randomly select a batch of data {xi, yi}n i=1 from {X, Y}, where n is the batchsize; / 3: for m = 1, , M do: / 4: Calculate the ANN output in the m-th layer xm a f m a ; / 5: Transform the output feature of ANN using Eq. 11 to obtain ym a ; / 6: Calculate the CNN output in the l-th layer xm c f m c ; / 7: Transform the output feature of CNN using Eq. 12 to obtain ym c ; / 8: end for / 9: Calculate the loss function Lmid in Eq. 13; / 10: Obtain the softmax outputs of CNN and ANN and denote as yc and ya, respectively. / 11: Compute the loss function Lblend in Eq. 4; / 12: Apply the KD loss L = βLmid + Lblend for Na; / 13: Calculate the normal cross entropy loss Lce = Pn i=1 Hcross(yi c, yi) for Nc; / 14: Update parameters in Nc and Na using Lce and L, respectively; / 15: until converge / Output: The resulting ANN Na with excellent performance. |
| Open Source Code | No | The paper does not provide any specific links to source code, nor does it explicitly state that the code for the described methodology is released or available in supplementary materials. |
| Open Datasets | Yes | In this section, we conduct experiments on several computer vision benchmark datasets, including CIFAR-10, CIFAR-100 and Image Net. |
| Dataset Splits | No | The paper states the total number of training and test images for CIFAR-10/100 (50k training, 10k test) and Image Net (1.2M training, 50k test) but does not explicitly provide details for a validation split or its methodology. |
| Hardware Specification | Yes | The batchsize is set to 256, and the experiments are conducted on 8 NVIDIA Tesla V100 GPUs. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | An initial learning rate of 0.1 is set to both CNN and ANN, and a cosine learning rate scheduler is used in training. Both models are trained for 400 epochs with a batchsize of 256. During the experiment we set hyper-parameters α = β {0.1, 0.5, 1, 5, 10}, and the best result among them is picked. The teacher and student models are trained for 150 epochs with an initial learning rate of 0.1 and a cosine learning rate decay scheduler. The weight decay and momentum are set to 0.0001 and 0.9, respectively. The batchsize is set to 256... |