AutoGraph: Optimizing DNN Computation Graph for Parallel GPU Kernel Execution

Authors: Yuxuan Zhao, Qi Sun, Zhuolun He, Yang Bai, Bei Yu

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our method achieves up to 3.47 speedup over existing graph optimization methods. Moreover, Auto Graph outperforms state-of-the-art parallel kernel launch frameworks by up to 1.26 .
Researcher Affiliation Collaboration Yuxuan Zhao1, Qi Sun1*, Zhuolun He1, Yang Bai1,2, Bei Yu1 1The Chinese University of Hong Kong 2Smart More {yxzhao21,qsun,zlhe,ybai,byu}@cse.cuhk.edu.hk
Pseudocode Yes The pseudocode of our DP-based method is provided in the appendix.
Open Source Code No The paper does not provide an explicit statement or a link to its own open-source code for the methodology described. It mentions using external tools like Py Torch and TASO's rules.
Open Datasets Yes Seven modern DNNs are benchmarked in the experiments, and the details of the models are shown in Table 1. Inception-v3 (Szegedy et al. 2016) and Res Net-50 (He et al. 2016) are widely used networks for image classification. Res Ne Xt-50 (Xie et al. 2017) introduces a new grouped convolution operator to replace the residual block and improves the model accuracy. Nas Net-A (Zoph et al. 2018) and Nas Net-Mobile (Zoph et al. 2018) are the representative CNN models with complicated model structures discovered by neural architecture search. RNNTC (Lei et al. 2017), a model for natural language processing tasks, is also tested. It is a sequence-to-sequence RNN model built on the simple recurrent unit (SRU) (Lei et al. 2017). BERT (Devlin et al. 2018), i.e., Bidirectional Encoder Representation from Transformers, is a powerful model which stacks the complicated transformers and has obtained state-of-the-art results on many tasks.
Dataset Splits No The paper uses well-known DNN models for benchmarking inference performance, but it does not explicitly specify the training, validation, or test dataset splits for these models. It focuses on optimizing the inference computation graph rather than the training process of the models themselves.
Hardware Specification Yes We conduct all the experiments on an Intel(R) Xeon(R) Silver 4114 CPU@ 2.20GHz. The hardware platform is an NVIDIA Ge Force RTX 2080Ti GPU with CUDA 11.0, cu DNN 8.0.5, and Py Torch 1.7.
Software Dependencies Yes The hardware platform is an NVIDIA Ge Force RTX 2080Ti GPU with CUDA 11.0, cu DNN 8.0.5, and Py Torch 1.7.
Experiment Setup Yes In our method, the 157 substitution rules from TASO (Jia et al. 2019a) are used as the substitution rule set R. We set α = 0.25 as the weight of critical path cost and set β = 1.1 for the backtracking search. The lower threshold Size L and the upper threshold Size U are set as 40 and 120, respectively. In each iteration, the top-20 candidate graphs are collected for onboard verification.