NORM: Knowledge Distillation via N-to-One Representation Matching

Authors: Xiaolong Liu, LUKING LI, Chao Li, Anbang Yao

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on different visual recognition benchmarks demonstrate the leading performance of our method. For instance, the Res Net18|Mobile Net|Res Net50-1/4 model trained by NORM reaches 72.14%|74.26%|68.03% top-1 accuracy on the Image Net dataset when using a pretrained Res Net34|Res Net50|Res Net50 model as the teacher, achieving an absolute improvement of 2.01%|4.63%|3.03% against the individually trained counterpart. Code is available at https://github.com/OSVAI/NORM.
Researcher Affiliation Industry Xiaolong Liu*, Lujun Li*, Chao Li*, Anbang Yao* Intel Labs China {xiaolong.liu,lujun.li,chao3.li,anbang.yao}@intel.com
Pseudocode No The paper describes the method using mathematical formulations such as: Fse = Wse Fs, Fsc = Wsc Fse, (2) where denotes the convolution operation. Next, we sequentially split the expanded student representation Fse into N non-overlapping segments F i se RH W Ct, 1 i N having the same number of feature channels as the teacher s.
Open Source Code Yes Code is available at https://github.com/OSVAI/NORM.
Open Datasets Yes We use CIFAR-100 (Krizhevsky & Hinton, 2009) and Image Net (Russakovsky et al., 2015) datasets for basic experiments.
Dataset Splits Yes CIFAR-100, which consists of 50,000 training images and 10,000 test images with 100 classes, is a popular classification dataset for KD research.
Hardware Specification Yes All models are trained on an Intel Xeon Silver 4214R CPU server using one NVIDIA Ge Force RTX 3090 GPU.
Software Dependencies No All experiments are implemented with Py Torch (Paszke et al., 2019).
Experiment Setup Yes Specifically, for each teacher-student pair, the model is trained by the stochastic gradient descent (SGD) optimizer for 240 epochs, with a batch size of 64, a weight decay of 0.0005 and a momentum of 0.9. The initial learning rate is set to 0.1 and decreased by a factor of 10 at epoch 150, 180 and 210.