reproducibilityindex.ai

Mixed Precision Training

Authors: Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We have run experiments for a variety of deep learning tasks covering a wide range of deep learning models. We conducted the following experiments for each application:
Researcher Affiliation	Industry	Sharan Narang , Gregory Diamos, Erich Elsen Baidu Research {sharan, gdiamos}@baidu.com Paulius Micikevicius , Jonah Alben, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu NVIDIA {pauliusm, alben, dagarcia, bginsburg, mhouston, okuchaiev, gavenkatesh, skyw}@nvidia.com
Pseudocode	No	The paper includes a diagram (Figure 1) illustrating the mixed precision training iteration but does not provide structured pseudocode or algorithm blocks.
Open Source Code	No	Footnote 1 points to "https://github.com/baidu-research/Deep Bench", which is a benchmark framework and not the explicit source code for the mixed precision training methodology described in the paper.
Open Datasets	Yes	We trained several CNNs for ILSVRC classiﬁcation task (Russakovsky et al., 2015) using mixed precision: Alexnet, VGG-D, Goog Le Net, Inception v2, Inception v3, and pre-activation Resnet-50. ...trained on WMT15 dataset. ...We trained English language model, designated as big LSTM (Jozefowicz et al., 2016), on the 1 billion word dataset. ...Celeb Faces dataset (Liu et al., 2015b).
Dataset Splits	Yes	Top-1 accuracy on ILSVRC validation set are shown in Table 1. ...English results are reported on the WSJ 92 test set.
Hardware Specification	Yes	The Baseline experiments were conducted on NVIDIA s Maxwell or Pascal GPU. Mixed Precision experiments were conducted on Volta V100 that accumulates FP16 products into FP32.
Software Dependencies	No	The paper mentions using "Caffe (Jia et al., 2014) framework" and "Py Torch (Paszke et al., 2017)" but does not specify exact version numbers for these software dependencies.
Experiment Setup	Yes	Loss-scaling is used for some applications. ...All hyper-parameters such as learning rate, annealing schedule and momentum were the same for baseline and pseudo FP16 experiments. ...trained for 20 epochs using Nesterov Stochastic Gradient Descent (SGD). ...The model consists of two layers of 8192 LSTM cells with projection to a 1024-dimensional embedding. This model was trained for 50 epochs using the Adagrad optimizer. ...Batch size aggregated over 4 GPUs is 1024. ...Adam optimizer was used to train for 100K iterations.