reproducibilityindex.ai

Transferable Graph Optimizers for ML Compilers

Authors: Yanqi Zhou, Sudip Roy, Amirali Abdolrashidi, Daniel Wong, Peter Ma, Qiumin Xu, Hanxiao Liu, Phitchaya Phothilimtha, Shen Wang, Anna Goldie, Azalia Mirhoseini, James Laudon

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On a diverse set of representative graphs consisting of up to 80,000 nodes, including Inception-v3, Transformer-XL, and Wave Net, GO achieves on average 21% improvement over human experts and 18% improvement over the prior state of the art with 15 faster convergence, on a device placement task evaluated in real systems.
Researcher Affiliation	Collaboration	1Google, Mountain View, CA, USA {yanqiz, sudipr, pcma, qiuminxu, hanxiaol, mangpo, shenwang, agoldie, azalia, jlaudon}@google.com 2UC Riverside, Riverside, CA, USA abdolrashidi@gmail.com 3Carnegie Mellon University, Pittsburgh, PA, USA wonglkd@gmail.com
Pseudocode	No	The paper describes network architectures and computational procedures using mathematical equations, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	Workloads: We evaluate GO using the computational graphs of six diverse architectures from different domains. Speciﬁcally, we use LSTM-based RNN Language Model [35, 15], GNMT [29], and Transformer-XL [8] from language domain; Inception V3 [30] and Amoeba Net [25] from computer vision; and ﬁnally Wave Net [31] from the speech domain.
Dataset Splits	Yes	Inspired by the pre-training and ﬁne-tuning method, we pretrain GO over all but one workloads. We randomly sample from this set of input graphs to construct a batch. We train GO for 1000 steps for each batch before switching to the next batch. We then ﬁne-tune the pre-trained model on the hold-out graphs (i.e., graphs from the sixth workload not included in the training set) for fewer than 50 steps, which takes less than one minute.
Hardware Specification	Yes	For placement task, where Tensor Flow provides an API for device assignment, our experiments are evaluated on actual hardware with conﬁguration of one Intel Broadwell CPU and up to eight Nvidia P100 GPUs. For fusion and scheduling tasks, where an API for setting nodes priorities is not available in Tensor Flow, we instead use an analytical performance model based on rooﬂine estimates (details in Supp. Mat. A.3) for V100 GPUs.
Software Dependencies	No	The paper mentions that 'All our workloads are implemented in Tensor Flow' and 'We adopted a Proximal Policy Optimization (PPO) [27] algorithm', but it does not specify version numbers for these software components.
Experiment Setup	No	The paper states: 'We ﬁnd a set of optimized hyper parameters and keep them ﬁxed for all the experiments presented. The optimal found PPO hyper parameters are presented in Supp. Mat. A.1.' This indicates the details are in supplementary material, not directly in the main text.