Piper: Multidimensional Planner for DNN Parallelization
Authors: Jakub M. Tarnawski, Deepak Narayanan, Amar Phanishayee
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Section 5, we evaluate Piper on real-world DNN profiles, and study the effects of combining the various parallelism modes and memory-saving optimizations on performance. Results of our evaluation in terms of the quality (TPS) of the obtained configurations are given in Figs. 1 and 2. |
| Researcher Affiliation | Industry | Jakub Tarnawski Microsoft Research jakub.tarnawski@microsoft.com Deepak Narayanan Microsoft Research dnarayanan@microsoft.com Amar Phanishayee Microsoft Research amar@microsoft.com |
| Pseudocode | No | The paper describes its algorithm in prose and mathematical notation within Section 4 'Algorithm' but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block or figure. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its source code for the described methodology, nor does it include a link to a code repository. |
| Open Datasets | No | The paper evaluates on a 'BERT-32 model' which is a specific model architecture they used for evaluation, not a publicly available dataset like ImageNet or CIFAR-10. No concrete access information (link, DOI, specific citation for a dataset) is provided. |
| Dataset Splits | No | The paper discusses model training and configuration but does not provide specific details regarding training, validation, and test dataset splits needed for data partitioning or reproducibility. |
| Hardware Specification | Yes | These TMPCs are obtained by profiling models implemented in Py Torch on NVidia A100 GPUs interconnected with a 300 GB/s bandwidth NVSwitch within a server, and 25 GB/s across servers. Training times were measured on a system with 8 NVidia DGX A100 machines, each with 8 GPUs. |
| Software Dependencies | No | The paper mentions 'Py Torch' as the framework used for implementing models, but it does not specify a version number for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | For our comparisons, we use a BERT-32 model, which consists of an embedding layer, 32 transformer layers and a pooling layer. We provide TMPCs for non-tensor-parallelized (t = 1) and tensor-parallelized executions of transformer layers [22] (t {2, 4, 8}), each with and without activation recomputation. Furthermore, Piper is given: the number of devices (K), available memory per device (M), the network bandwidth (B), and the target number of microbatches in a batch (N). N is the ratio of the chosen batch size (usually the maximum that is safe for convergence, e.g., 1024 2048 for large transformer-based LMs) to the provided microbatch size. |