Efficient Algorithms for Device Placement of DNN Graph Operators

Authors: Jakub M. Tarnawski, Amar Phanishayee, Nikhil Devanur, Divya Mahajan, Fanny Nina Paravecino

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the applicability and efficiency of our approaches using several contemporary DNN computation graphs. We evaluate our partitioning algorithms for different scenarios described above for a variety of modern DNN workloads (7 DNNs, 16 layer and operator graphs). We find that the placements are efficient and result in non-trivial optimal splits; non-contiguous splits outperform all the techniques, with an improvement of up to 2 over expert (average 1.46 ), 2.08 over local search (average 1.29 ) [MKA07], 1.21 over Pipe Dream (average 1.10 ) [NHP+19], 7.69 over Scotch (average 1.50 ) [Pel09].
Researcher Affiliation Industry Jakub Tarnawski Microsoft Research Amar Phanishayee Microsoft Research Nikhil Devanur Amazon Divya Mahajan Microsoft Fanny Nina Paravecino Microsoft
Pseudocode No The paper describes algorithms using mathematical formulations and prose, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes The code and workloads used for evaluations are available at https://github.com/msr-fiddle/dnn-partitioning.
Open Datasets No The paper mentions using 'BERT', 'Res Net50', 'Inception-v3', and 'GNMT' as DNN models/workloads, but does not provide concrete access information (links, citations with author/year, or repository names) for the specific datasets used for their evaluation.
Dataset Splits No The paper discusses splitting DNN models across accelerators for parallelism, but it does not provide specific details on training, validation, or test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology) for reproducing the data partitioning.
Hardware Specification Yes The DNN workloads are split across 6 accelerators of the same type (GPU for layer graphs, a hardware accelerator representing TPUs or FPGAs for operator graphs). We use 3 accelerators in case of the smaller BERT-3 and BERT-6 models. Each accelerator has 16 GB of DRAM and is connected to the CPU over a PCIE 3.0 interconnect.
Software Dependencies No The paper mentions using a 'commercial-grade solver [GO19]' (Gurobi optimizer), but does not provide specific version numbers for this or any other software component used in the experiments.
Experiment Setup Yes More details about our experimental setup, graph topology, and implementations can be found in Appendix E.