Reducing Underflow in Mixed Precision Training by Gradient Scaling
Authors: Ruizhe Zhao, Brian Vogel, Tanvir Ahmed, Wayne Luk
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on a variety of networks and tasks show that gradient scaling can improve accuracy and reduce overall training effort compared with the state-of-the-art MPT. |
| Researcher Affiliation | Collaboration | Ruizhe Zhao1 , Brian Vogel2 , Tanvir Ahmed2 and Wayne Luk1 1Imperial College London 2Preferred Networks, Inc. |
| Pseudocode | Yes | Algorithm 1: The gradient scaling algorithm. |
| Open Source Code | No | The paper states it implemented gradient scaling in Chainer and used Chainer CV, providing links to general ChainerCV examples (e.g., 'https://github.com/chainer/ chainercv/blob/master/examples/ssd/'), but does not explicitly provide a link to the open-source code for *their specific implementation* of gradient scaling discussed in the paper. |
| Open Datasets | Yes | We first examine Res Net-18 and Res Net-50 [He et al., 2016] on the ILSVRC 2012 dataset [Russakovsky et al., 2015]. ... SSD is trained on a joint of the PASCAL VOC 2007 and 2012 training sets... Seg Net is evaluated on the Cam Vid road scenes dataset... |
| Dataset Splits | Yes | Seg Net is evaluated on the Cam Vid road scenes dataset, which has 367 training and 233 test images at the resolution of 360 × 480. |
| Hardware Specification | Yes | Most of our experiments run on NVIDIA RTX 2080Ti that has the latest Turing architecture [Choquette et al., 2018] and enables Tensor Core operations. We also have limited access to a multi-GPU cluster that has several NVIDIA Tesla V100 installed to run distributed training. |
| Software Dependencies | Yes | We implemented gradient scaling in the Chainer [Tokui et al., 2019] framework (version 6.3.0). ... The CUDA library (version 10.1) we use supports FP16 and Tensor Core [Gupta, 2019]. |
| Experiment Setup | Yes | The only hyperparameter for gradient scaling is the underflow rate threshold γ , which is set as 10−3 for all experiments. ... We train them in the same hyperparameter setting as [Goyal et al., 2017]4 with 16 GPUs. ... when training it, we use SGD with batch size 12, learning rate 0.1, and momentum 0.9 for 16K iterations. |