THOR, Trace-based Hardware-driven Layer-Oriented Natural Gradient Descent Computation
Authors: Mengyun Chen, Kaixin Gao, Xiaolei Liu, Zidong Wang, Ningxi Ni, Qian Zhang, Lei Chen, Chao Ding, Zhenghai Huang, Min Wang, Shuangling Wang, Fan Yu, Xinyuan Zhao, Dachuan Xu7046-7054
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate the effectiveness of THOR, we have conducted extensive experiments. The results show that training Res Net-50 on Image Net with THOR only takes 66.7 minutes to achieve a top-1 accuracy of 75.9 % under an 8 Ascend 910 environment with Mind Spore, a new deep learning computing framework. |
| Researcher Affiliation | Collaboration | 1Huawei Technologies Co. Ltd 2Tianjin University 3Beijing University of Technology 4Hong Kong University of Science and Technology 5Chinese Academy of Sciences |
| Pseudocode | Yes | Algorithm 1 THOR |
| Open Source Code | Yes | Furthermore, part of our algorithm has been open sourced 1, and the code will continue to be improved in the future. 1THOR: https://gitee.com/mindspore/mindspore/tree/master/ model zoo/official/cv/resnet thor. |
| Open Datasets | Yes | To test the performance, we apply THOR to train Res Net-18 for CIFAR-10 and Res Net-50 for Image Net. |
| Dataset Splits | No | The paper states that ResNet-18 is trained on CIFAR-10 and ResNet-50 on ImageNet but does not explicitly specify the training/validation/test dataset splits using percentages, counts, or specific references to pre-defined validation splits. |
| Hardware Specification | Yes | The results show that training Res Net-50 on Image Net with THOR only takes 66.7 minutes to achieve a top-1 accuracy of 75.9 % under an 8 Ascend 910 environment with Mind Spore... In this experiment, we use pytorch on 1 Tesla v100... we implement THOR on Mind Spore with 8 Ascend 910 |
| Software Dependencies | No | The paper mentions using 'Mind Spore' and 'pytorch' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | In this experiment, we use pytorch on 1 Tesla v100 and train Res Net-18 on CIFAR-10 with batch-size 128. And we set the same learning rate for Momentum, KFAC, THOR, THOR stop and THOR NT and same damping for KFAC, THOR, THOR stop and THOR NT. The learning rate α(e) for e epoch and the damping λ(e) are defined as follows: α(e) = 0.1 * 10^(floor(e/70)) λ(e) = 0.3 * 10^(floor(e/70)). The weight decay for Momentum, KFAC, THOR, THOR stop and THOR NT is set to 0.0005. The trace thresholds are set to (ω1, ω2) = (0.01, 0) for THOR, (ω1, ω2) = (0.01, 0.001) for THOR stop and (ω1, ω2) = (0, 0) for THOR NT. The update interval for KFAC is set to 20. ... In this experiment, we implement THOR on Mind Spore with 8 Ascend 910 and train Res Net-50 on Image Net with batch-size 256. The weight decay for these methods is set to 0.0005 and the label smoothing is set to 0.1. The trace thresholds are set to (ω1, ω2) = (0.01, 0) for THOR, (ω1, ω2) = (0.01, 0.001) for THOR stop and (ω1, ω2) = (0, 0) for THOR NT. Split dimension, learning rate, damping and update interval can be found in Figure 8. The learning rate α(e) for e epoch is determined as follows: α(e) = αtarget * (1 - e/eend)^pdecay. The damping λ adopts the following decreasing rule: λ(e) = λ(0) * ρ^(e/10)^decay. The hyper-parameters for our methods are shown in Table 3. |