Reducing Underflow in Mixed Precision Training by Gradient Scaling

Authors: Ruizhe Zhao, Brian Vogel, Tanvir Ahmed, Wayne Luk

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on a variety of networks and tasks show that gradient scaling can improve accuracy and reduce overall training effort compared with the state-of-the-art MPT.
Researcher Affiliation Collaboration Ruizhe Zhao1 , Brian Vogel2 , Tanvir Ahmed2 and Wayne Luk1 1Imperial College London 2Preferred Networks, Inc.
Pseudocode Yes Algorithm 1: The gradient scaling algorithm.
Open Source Code No The paper states it implemented gradient scaling in Chainer and used Chainer CV, providing links to general ChainerCV examples (e.g., 'https://github.com/chainer/ chainercv/blob/master/examples/ssd/'), but does not explicitly provide a link to the open-source code for *their specific implementation* of gradient scaling discussed in the paper.
Open Datasets Yes We first examine Res Net-18 and Res Net-50 [He et al., 2016] on the ILSVRC 2012 dataset [Russakovsky et al., 2015]. ... SSD is trained on a joint of the PASCAL VOC 2007 and 2012 training sets... Seg Net is evaluated on the Cam Vid road scenes dataset...
Dataset Splits Yes Seg Net is evaluated on the Cam Vid road scenes dataset, which has 367 training and 233 test images at the resolution of 360 × 480.
Hardware Specification Yes Most of our experiments run on NVIDIA RTX 2080Ti that has the latest Turing architecture [Choquette et al., 2018] and enables Tensor Core operations. We also have limited access to a multi-GPU cluster that has several NVIDIA Tesla V100 installed to run distributed training.
Software Dependencies Yes We implemented gradient scaling in the Chainer [Tokui et al., 2019] framework (version 6.3.0). ... The CUDA library (version 10.1) we use supports FP16 and Tensor Core [Gupta, 2019].
Experiment Setup Yes The only hyperparameter for gradient scaling is the underflow rate threshold γ , which is set as 10−3 for all experiments. ... We train them in the same hyperparameter setting as [Goyal et al., 2017]4 with 16 GPUs. ... when training it, we use SGD with batch size 12, learning rate 0.1, and momentum 0.9 for 16K iterations.