Memory-Efficient Pipeline-Parallel DNN Training

Authors: Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, Matei Zaharia

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Evaluation In this section, we show that the Adam optimizer with 2BW has similar semantics to vanilla Adam, and that Pipe Dream2BW and Pipe Dream-Flush are able to train large models faster than existing model-parallel approaches including Megatron (Shoeybi et al., 2019), and existing pipelining approaches like GPipe (Huang et al., 2019).
Researcher Affiliation Collaboration Deepak Narayanan 1 * Amar Phanishayee 2 Kaiyu Shi 3 Xie Chen 3 Matei Zaharia 1 1Stanford University 2Microsoft Research 3Microsoft. Correspondence to: Deepak Narayanan <deepakn@cs.stanford.edu>.
Pseudocode Yes The full algorithm is shown in Appendix A.
Open Source Code No Our implementation uses Py Torch and is adapted from the Megatron repository (meg); we verified that single-worker performance with this implementation achieves about 45 TFLOPS on a 355M-parameter GPT model and is competitive with existing state-of-the-art open source implementations from NVIDIA (nvi). The paper states it used and adapted existing open-source code but does not provide a link or explicit statement for its own implementation being open-sourced.
Open Datasets Yes We use the Open Web Text dataset (ope) for pretraining. Open Web Text Dataset. https://github.com/ jcpeterson/openwebtext.
Dataset Splits No Figure 5 shows the training and validation loss for the two models. To further validate the quality of the pre-trained model, we finetuned the pre-trained vanilla and 2BW BERT models on downstream MNLI and RACE tasks (Wang et al., 2019; Lai et al., 2017). While validation is performed and results are shown, the paper does not provide specific details on the dataset splits (e.g., percentages or counts for training, validation, and test sets, or how these splits were performed).
Hardware Specification Yes Hardware. We show results on two different hardware setups on AWS: eight 8 V100 servers (64 GPUs) with NVLink and 16GB of per-GPU memory, and a single 8 V100 server. We use p3.16xlarge instances.
Software Dependencies No Our implementation uses Py Torch and is adapted from the Megatron repository (meg); we verified that single-worker performance with this implementation achieves about 45 TFLOPS on a 355M-parameter GPT model and is competitive with existing state-of-the-art open source implementations from NVIDIA (nvi). The paper mentions software names but does not provide specific version numbers for PyTorch or other libraries.
Experiment Setup Yes To provide a fair comparison, we use the same hyperparameters, including batch size, used by Megatron (Shoeybi et al., 2019) to train these BERT and GPT models. For BERT, we use a batch size of 1024, and for GPT, we use a batch size of 512. We use the Adam optimizer with standard hyperparameters (learning rate of 10 4 with initial warmup and subsequent linear decay, maximum sequence length of 512), and mixed precision.