Asymmetric Temperature Scaling Makes Larger Networks Teach Well Again
Authors: Xin-Chun Li, Wen-shu Fan, Shaoming Song, Yinchuan Li, bingshuai Li, Shao Yunfeng, De-Chuan Zhan
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Both theoretical analysis and extensive experimental results demonstrate the effectiveness of ATS. |
| Researcher Affiliation | Collaboration | 1 State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 2 Huawei Noah s Ark Lab, Beijing, China |
| Pseudocode | No | No structured pseudocode or algorithm block was found. |
| Open Source Code | Yes | The demo developed in Mindspore is available at https://gitee.com/lxcnju/ats-mindspore and will be available at https: //gitee.com/mindspore/models/tree/master/research/cv/ats. |
| Open Datasets | Yes | We use CIFAR-10/CIFAR-100 [21], Tiny Image Net [43], CUB [47], Stanford Dogs [19], and Google Speech Commands [48] as the datasets. |
| Dataset Splits | Yes | Other dataset, network and training details are in Appendix C. For these datasets [CIFAR-10/100, Tiny ImageNet, CUB, Stanford Dogs], we use the standard training/validation/test splits. For Google Speech Commands dataset, we use the standard training/testing split with 80% and 20% respectively, and validate on 5% of training data. |
| Hardware Specification | Yes | We run all experiments on GPU servers (NVIDIA GeForce RTX 3090, NVIDIA GeForce RTX 2080 Ti). |
| Software Dependencies | Yes | Our experiments are implemented with PyTorch 1.7.0 and MindSpore 1.3.0. |
| Experiment Setup | Yes | Except that the Google Speech Commands takes 50 epochs, we train networks on other datasets with 240 epochs. We use the SGD optimizer with 0.9 momentum. For VGG, Alex Net, Res Net, Wide Res Net, and Res Ne Xt, we set the learning rate as 0.05 (recommended by [44]). For Shuffle Net and Mobile Net, we use a smaller learning rate of 0.01 (recommended by [44]). We use the pre-trained models provided in Py Torch for CUB and Stanford Dogs, and correspondingly, their learning rates are scaled by 0.1 . During training, we decay the learning rate by 0.1 every 30 epochs after the first 150 epochs (recommended by [44]). For Google Speech Commands, we decay the learning rate via the cosine annealing. We set the batch size as 128 for CIFAR data, 64 for other datasets. |