GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Authors: Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, zhifeng Chen
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the advantages of GPipe by training large-scale neural networks on two different tasks with distinct network architectures: (i) Image Classification: We train a 557-million-parameter Amoeba Net model and attain a top-1 accuracy of 84.4% on Image Net-2012, (ii) Multilingual Neural Machine Translation: We train a single 6-billion-parameter, 128-layer Transformer model on a corpus spanning over 100 languages and achieve better quality than all bilingual models. |
| Researcher Affiliation | Industry | {huangyp,ylc,ankurbpn,orhanf,miachen,dehao hyouklee,jngiam,qvl,yonghui,zhifengc}@google.com |
| Pseudocode | No | The paper describes the algorithm using text and a diagram (Figure 2c), but it does not include explicitly labeled pseudocode or an algorithm block in the main text. It refers to supplementary material for examples, which may contain pseudocode, but this is not present in the main paper. |
| Open Source Code | No | The paper states 'This open-source library is implemented under the Lingvo [16] framework.' This indicates that GPipe is built within an open-source framework, but it does not explicitly state that the specific code for GPipe as described and evaluated in this paper is publicly released or provide a link to it. |
| Open Datasets | Yes | Image Net-2012 dataset; CIFAR-10, CIFAR-100, Stanford Cars, Oxford Pets, Food-101, FGVC Aircraft, Birdsnap (all in Table 5); We use a corpus of parallel documents over 102 languages and English, containing a total of 25 billion training examples... [37]. |
| Dataset Splits | Yes | We train a 557-million-parameter Amoeba Net model and attain a top-1 accuracy of 84.4% on Image Net-2012; 84.4% top-1 accuracy for the 550M parameter Amoeba Net model; top-1 validation accuracy of 84.4%. |
| Hardware Specification | Yes | We ran the experiments on Cloud TPUv2s with 8GB memory per accelerator. We next trained Transformer models using Cloud TPUv3s with 16GB memory per accelerator core. we ran our experiments on a single host with multiple NVIDIA P100 GPUs. |
| Software Dependencies | No | The paper states 'This open-source library is implemented under the Lingvo [16] framework.' While Lingvo is mentioned, no specific version numbers are provided for Lingvo or any other software dependencies. |
| Experiment Setup | Yes | We used a fixed input image size of 224 224 and mini-batch size of 128. We used a fixed vocabulary size of 32k, sequence length 1024 and batch size 32. Each Transformer layer has 2048 for model dimension, 8192 for feed-forward hidden dimension and 32 attention heads. The number of micro-batches was fixed at 32. Input images to the network during training were resized to 480 480, horizontally flipped randomly and augmented using cutout [24]. We clip the logit predictions (softmax pre-activations) whenever their magnitude exceeds a certain value. |