Asymmetric Temperature Scaling Makes Larger Networks Teach Well Again

Authors: Xin-Chun Li, Wen-shu Fan, Shaoming Song, Yinchuan Li, bingshuai Li, Shao Yunfeng, De-Chuan Zhan

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Both theoretical analysis and extensive experimental results demonstrate the effectiveness of ATS.
Researcher Affiliation Collaboration 1 State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 2 Huawei Noah s Ark Lab, Beijing, China
Pseudocode No No structured pseudocode or algorithm block was found.
Open Source Code Yes The demo developed in Mindspore is available at https://gitee.com/lxcnju/ats-mindspore and will be available at https: //gitee.com/mindspore/models/tree/master/research/cv/ats.
Open Datasets Yes We use CIFAR-10/CIFAR-100 [21], Tiny Image Net [43], CUB [47], Stanford Dogs [19], and Google Speech Commands [48] as the datasets.
Dataset Splits Yes Other dataset, network and training details are in Appendix C. For these datasets [CIFAR-10/100, Tiny ImageNet, CUB, Stanford Dogs], we use the standard training/validation/test splits. For Google Speech Commands dataset, we use the standard training/testing split with 80% and 20% respectively, and validate on 5% of training data.
Hardware Specification Yes We run all experiments on GPU servers (NVIDIA GeForce RTX 3090, NVIDIA GeForce RTX 2080 Ti).
Software Dependencies Yes Our experiments are implemented with PyTorch 1.7.0 and MindSpore 1.3.0.
Experiment Setup Yes Except that the Google Speech Commands takes 50 epochs, we train networks on other datasets with 240 epochs. We use the SGD optimizer with 0.9 momentum. For VGG, Alex Net, Res Net, Wide Res Net, and Res Ne Xt, we set the learning rate as 0.05 (recommended by [44]). For Shuffle Net and Mobile Net, we use a smaller learning rate of 0.01 (recommended by [44]). We use the pre-trained models provided in Py Torch for CUB and Stanford Dogs, and correspondingly, their learning rates are scaled by 0.1 . During training, we decay the learning rate by 0.1 every 30 epochs after the first 150 epochs (recommended by [44]). For Google Speech Commands, we decay the learning rate via the cosine annealing. We set the batch size as 128 for CIFAR data, 64 for other datasets.