Transferable Graph Optimizers for ML Compilers
Authors: Yanqi Zhou, Sudip Roy, Amirali Abdolrashidi, Daniel Wong, Peter Ma, Qiumin Xu, Hanxiao Liu, Phitchaya Phothilimtha, Shen Wang, Anna Goldie, Azalia Mirhoseini, James Laudon
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On a diverse set of representative graphs consisting of up to 80,000 nodes, including Inception-v3, Transformer-XL, and Wave Net, GO achieves on average 21% improvement over human experts and 18% improvement over the prior state of the art with 15 faster convergence, on a device placement task evaluated in real systems. |
| Researcher Affiliation | Collaboration | 1Google, Mountain View, CA, USA {yanqiz, sudipr, pcma, qiuminxu, hanxiaol, mangpo, shenwang, agoldie, azalia, jlaudon}@google.com 2UC Riverside, Riverside, CA, USA abdolrashidi@gmail.com 3Carnegie Mellon University, Pittsburgh, PA, USA wonglkd@gmail.com |
| Pseudocode | No | The paper describes network architectures and computational procedures using mathematical equations, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | Workloads: We evaluate GO using the computational graphs of six diverse architectures from different domains. Specifically, we use LSTM-based RNN Language Model [35, 15], GNMT [29], and Transformer-XL [8] from language domain; Inception V3 [30] and Amoeba Net [25] from computer vision; and finally Wave Net [31] from the speech domain. |
| Dataset Splits | Yes | Inspired by the pre-training and fine-tuning method, we pretrain GO over all but one workloads. We randomly sample from this set of input graphs to construct a batch. We train GO for 1000 steps for each batch before switching to the next batch. We then fine-tune the pre-trained model on the hold-out graphs (i.e., graphs from the sixth workload not included in the training set) for fewer than 50 steps, which takes less than one minute. |
| Hardware Specification | Yes | For placement task, where Tensor Flow provides an API for device assignment, our experiments are evaluated on actual hardware with configuration of one Intel Broadwell CPU and up to eight Nvidia P100 GPUs. For fusion and scheduling tasks, where an API for setting nodes priorities is not available in Tensor Flow, we instead use an analytical performance model based on roofline estimates (details in Supp. Mat. A.3) for V100 GPUs. |
| Software Dependencies | No | The paper mentions that 'All our workloads are implemented in Tensor Flow' and 'We adopted a Proximal Policy Optimization (PPO) [27] algorithm', but it does not specify version numbers for these software components. |
| Experiment Setup | No | The paper states: 'We find a set of optimized hyper parameters and keep them fixed for all the experiments presented. The optimal found PPO hyper parameters are presented in Supp. Mat. A.1.' This indicates the details are in supplementary material, not directly in the main text. |