Mixed Precision Training
Authors: Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We have run experiments for a variety of deep learning tasks covering a wide range of deep learning models. We conducted the following experiments for each application: |
| Researcher Affiliation | Industry | Sharan Narang , Gregory Diamos, Erich Elsen Baidu Research {sharan, gdiamos}@baidu.com Paulius Micikevicius , Jonah Alben, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu NVIDIA {pauliusm, alben, dagarcia, bginsburg, mhouston, okuchaiev, gavenkatesh, skyw}@nvidia.com |
| Pseudocode | No | The paper includes a diagram (Figure 1) illustrating the mixed precision training iteration but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | No | Footnote 1 points to "https://github.com/baidu-research/Deep Bench", which is a benchmark framework and not the explicit source code for the mixed precision training methodology described in the paper. |
| Open Datasets | Yes | We trained several CNNs for ILSVRC classification task (Russakovsky et al., 2015) using mixed precision: Alexnet, VGG-D, Goog Le Net, Inception v2, Inception v3, and pre-activation Resnet-50. ...trained on WMT15 dataset. ...We trained English language model, designated as big LSTM (Jozefowicz et al., 2016), on the 1 billion word dataset. ...Celeb Faces dataset (Liu et al., 2015b). |
| Dataset Splits | Yes | Top-1 accuracy on ILSVRC validation set are shown in Table 1. ...English results are reported on the WSJ 92 test set. |
| Hardware Specification | Yes | The Baseline experiments were conducted on NVIDIA s Maxwell or Pascal GPU. Mixed Precision experiments were conducted on Volta V100 that accumulates FP16 products into FP32. |
| Software Dependencies | No | The paper mentions using "Caffe (Jia et al., 2014) framework" and "Py Torch (Paszke et al., 2017)" but does not specify exact version numbers for these software dependencies. |
| Experiment Setup | Yes | Loss-scaling is used for some applications. ...All hyper-parameters such as learning rate, annealing schedule and momentum were the same for baseline and pseudo FP16 experiments. ...trained for 20 epochs using Nesterov Stochastic Gradient Descent (SGD). ...The model consists of two layers of 8192 LSTM cells with projection to a 1024-dimensional embedding. This model was trained for 50 epochs using the Adagrad optimizer. ...Batch size aggregated over 4 GPUs is 1024. ...Adam optimizer was used to train for 100K iterations. |