reproducibilityindex.ai

Reducing Underflow in Mixed Precision Training by Gradient Scaling

Authors: Ruizhe Zhao, Brian Vogel, Tanvir Ahmed, Wayne Luk

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on a variety of networks and tasks show that gradient scaling can improve accuracy and reduce overall training effort compared with the state-of-the-art MPT.
Researcher Affiliation	Collaboration	Ruizhe Zhao1 , Brian Vogel2 , Tanvir Ahmed2 and Wayne Luk1 1Imperial College London 2Preferred Networks, Inc.
Pseudocode	Yes	Algorithm 1: The gradient scaling algorithm.
Open Source Code	No	The paper states it implemented gradient scaling in Chainer and used Chainer CV, providing links to general ChainerCV examples (e.g., 'https://github.com/chainer/ chainercv/blob/master/examples/ssd/'), but does not explicitly provide a link to the open-source code for their specific implementation of gradient scaling discussed in the paper.
Open Datasets	Yes	We ﬁrst examine Res Net-18 and Res Net-50 [He et al., 2016] on the ILSVRC 2012 dataset [Russakovsky et al., 2015]. ... SSD is trained on a joint of the PASCAL VOC 2007 and 2012 training sets... Seg Net is evaluated on the Cam Vid road scenes dataset...
Dataset Splits	Yes	Seg Net is evaluated on the Cam Vid road scenes dataset, which has 367 training and 233 test images at the resolution of 360 × 480.
Hardware Specification	Yes	Most of our experiments run on NVIDIA RTX 2080Ti that has the latest Turing architecture [Choquette et al., 2018] and enables Tensor Core operations. We also have limited access to a multi-GPU cluster that has several NVIDIA Tesla V100 installed to run distributed training.
Software Dependencies	Yes	We implemented gradient scaling in the Chainer [Tokui et al., 2019] framework (version 6.3.0). ... The CUDA library (version 10.1) we use supports FP16 and Tensor Core [Gupta, 2019].
Experiment Setup	Yes	The only hyperparameter for gradient scaling is the underﬂow rate threshold γ , which is set as 10−3 for all experiments. ... We train them in the same hyperparameter setting as [Goyal et al., 2017]4 with 16 GPUs. ... when training it, we use SGD with batch size 12, learning rate 0.1, and momentum 0.9 for 16K iterations.