reproducibilityindex.ai

NORM: Knowledge Distillation via N-to-One Representation Matching

Authors: Xiaolong Liu, LUKING LI, Chao Li, Anbang Yao

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on different visual recognition benchmarks demonstrate the leading performance of our method. For instance, the Res Net18\|Mobile Net\|Res Net50-1/4 model trained by NORM reaches 72.14%\|74.26%\|68.03% top-1 accuracy on the Image Net dataset when using a pretrained Res Net34\|Res Net50\|Res Net50 model as the teacher, achieving an absolute improvement of 2.01%\|4.63%\|3.03% against the individually trained counterpart. Code is available at https://github.com/OSVAI/NORM.
Researcher Affiliation	Industry	Xiaolong Liu, Lujun Li, Chao Li, Anbang Yao Intel Labs China {xiaolong.liu,lujun.li,chao3.li,anbang.yao}@intel.com
Pseudocode	No	The paper describes the method using mathematical formulations such as: Fse = Wse Fs, Fsc = Wsc Fse, (2) where denotes the convolution operation. Next, we sequentially split the expanded student representation Fse into N non-overlapping segments F i se RH W Ct, 1 i N having the same number of feature channels as the teacher s.
Open Source Code	Yes	Code is available at https://github.com/OSVAI/NORM.
Open Datasets	Yes	We use CIFAR-100 (Krizhevsky & Hinton, 2009) and Image Net (Russakovsky et al., 2015) datasets for basic experiments.
Dataset Splits	Yes	CIFAR-100, which consists of 50,000 training images and 10,000 test images with 100 classes, is a popular classiﬁcation dataset for KD research.
Hardware Specification	Yes	All models are trained on an Intel Xeon Silver 4214R CPU server using one NVIDIA Ge Force RTX 3090 GPU.
Software Dependencies	No	All experiments are implemented with Py Torch (Paszke et al., 2019).
Experiment Setup	Yes	Speciﬁcally, for each teacher-student pair, the model is trained by the stochastic gradient descent (SGD) optimizer for 240 epochs, with a batch size of 64, a weight decay of 0.0005 and a momentum of 0.9. The initial learning rate is set to 0.1 and decreased by a factor of 10 at epoch 150, 180 and 210.