reproducibilityindex.ai

Asymmetric Temperature Scaling Makes Larger Networks Teach Well Again

Authors: Xin-Chun Li, Wen-shu Fan, Shaoming Song, Yinchuan Li, bingshuai Li, Shao Yunfeng, De-Chuan Zhan

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Both theoretical analysis and extensive experimental results demonstrate the effectiveness of ATS.
Researcher Affiliation	Collaboration	1 State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 2 Huawei Noah s Ark Lab, Beijing, China
Pseudocode	No	No structured pseudocode or algorithm block was found.
Open Source Code	Yes	The demo developed in Mindspore is available at https://gitee.com/lxcnju/ats-mindspore and will be available at https: //gitee.com/mindspore/models/tree/master/research/cv/ats.
Open Datasets	Yes	We use CIFAR-10/CIFAR-100 [21], Tiny Image Net [43], CUB [47], Stanford Dogs [19], and Google Speech Commands [48] as the datasets.
Dataset Splits	Yes	Other dataset, network and training details are in Appendix C. For these datasets [CIFAR-10/100, Tiny ImageNet, CUB, Stanford Dogs], we use the standard training/validation/test splits. For Google Speech Commands dataset, we use the standard training/testing split with 80% and 20% respectively, and validate on 5% of training data.
Hardware Specification	Yes	We run all experiments on GPU servers (NVIDIA GeForce RTX 3090, NVIDIA GeForce RTX 2080 Ti).
Software Dependencies	Yes	Our experiments are implemented with PyTorch 1.7.0 and MindSpore 1.3.0.
Experiment Setup	Yes	Except that the Google Speech Commands takes 50 epochs, we train networks on other datasets with 240 epochs. We use the SGD optimizer with 0.9 momentum. For VGG, Alex Net, Res Net, Wide Res Net, and Res Ne Xt, we set the learning rate as 0.05 (recommended by [44]). For Shufﬂe Net and Mobile Net, we use a smaller learning rate of 0.01 (recommended by [44]). We use the pre-trained models provided in Py Torch for CUB and Stanford Dogs, and correspondingly, their learning rates are scaled by 0.1 . During training, we decay the learning rate by 0.1 every 30 epochs after the ﬁrst 150 epochs (recommended by [44]). For Google Speech Commands, we decay the learning rate via the cosine annealing. We set the batch size as 128 for CIFAR data, 64 for other datasets.