NORM: Knowledge Distillation via N-to-One Representation Matching
Authors: Xiaolong Liu, LUKING LI, Chao Li, Anbang Yao
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on different visual recognition benchmarks demonstrate the leading performance of our method. For instance, the Res Net18|Mobile Net|Res Net50-1/4 model trained by NORM reaches 72.14%|74.26%|68.03% top-1 accuracy on the Image Net dataset when using a pretrained Res Net34|Res Net50|Res Net50 model as the teacher, achieving an absolute improvement of 2.01%|4.63%|3.03% against the individually trained counterpart. Code is available at https://github.com/OSVAI/NORM. |
| Researcher Affiliation | Industry | Xiaolong Liu*, Lujun Li*, Chao Li*, Anbang Yao* Intel Labs China {xiaolong.liu,lujun.li,chao3.li,anbang.yao}@intel.com |
| Pseudocode | No | The paper describes the method using mathematical formulations such as: Fse = Wse Fs, Fsc = Wsc Fse, (2) where denotes the convolution operation. Next, we sequentially split the expanded student representation Fse into N non-overlapping segments F i se RH W Ct, 1 i N having the same number of feature channels as the teacher s. |
| Open Source Code | Yes | Code is available at https://github.com/OSVAI/NORM. |
| Open Datasets | Yes | We use CIFAR-100 (Krizhevsky & Hinton, 2009) and Image Net (Russakovsky et al., 2015) datasets for basic experiments. |
| Dataset Splits | Yes | CIFAR-100, which consists of 50,000 training images and 10,000 test images with 100 classes, is a popular classification dataset for KD research. |
| Hardware Specification | Yes | All models are trained on an Intel Xeon Silver 4214R CPU server using one NVIDIA Ge Force RTX 3090 GPU. |
| Software Dependencies | No | All experiments are implemented with Py Torch (Paszke et al., 2019). |
| Experiment Setup | Yes | Specifically, for each teacher-student pair, the model is trained by the stochastic gradient descent (SGD) optimizer for 240 epochs, with a batch size of 64, a weight decay of 0.0005 and a momentum of 0.9. The initial learning rate is set to 0.1 and decreased by a factor of 10 at epoch 150, 180 and 210. |