reproducibilityindex.ai

Learning to Optimize Tensor Programs

Authors: Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our framework delivers performance that is competitive with state-of-the-art hand-tuned libraries for low-power CPUs, mobile GPUs, and server-class GPUs. We provide a detailed empirical analysis of component design choices in this framework. Experimental results on real-world DL workloads show that our framework yields end-to-end performance improvements ranging from 1.2 to 3.8 over existing frameworks.
Researcher Affiliation	Academia	1Paul G. Allen School of Computer Science & Engineering, University of Washington 2Shanghai Jiao Tong University
Pseudocode	Yes	Algorithm 1: Learning to Optimize Tensor Programs
Open Source Code	Yes	Our framework can be found at https://tvm.ai.
Open Datasets	Yes	Component evaluations were based on convolution workloads in Res Net-18 [14] for Image Net classiﬁcation (Table 1).
Dataset Splits	No	The paper mentions using ResNet-18 and MobileNet, which are typically evaluated on standard datasets with predefined splits, but it does not explicitly provide the specific training, validation, or test dataset splits (e.g., percentages or counts) used for their experiments.
Hardware Specification	Yes	We compared our approach to existing DL frameworks backed by highly engineered hardware-speciﬁc libraries on diverse hardware back-ends: a server class GPU, an embedded CPU, and a mobile GPU. The baselines were: cu DNN v7 for the NVIDIA GPU, TFLite(commit: 7558b085) for the Cortex-A53, and the ARM Compute Library (v18.03) for the ARM Mali GPU.
Software Dependencies	Yes	The baselines were: cu DNN v7 for the NVIDIA GPU, TFLite(commit: 7558b085) for the Cortex-A53, and the ARM Compute Library (v18.03) for the ARM Mali GPU. Our baselines were: MXNet (v1.1), Tensorﬂow (v1.7) for the GPU, TFLite(commit: 7558b085) for the Cortex A53, and ARM Compute Library (v18.03) for the ARM Mali GPU.
Experiment Setup	Yes	Algorithm 1: Learning to Optimize Tensor Programs Input : Transformation space Se Output : Selected schedule conﬁguration s D while n_trials < max_n_trials do // Pick the next promising batch Q run parallel simulated annealing to collect candidates in Se using energy function ˆf S run greedy submodular optimization to pick (1 ϵ)b-subset from Q by maximizing Equation 3 S S { Randomly sample ϵb candidates. } // Run measurement on hardware environment for s in S do c f(g(e, s)); D D {(e, s, c)} end // Update cost model update ˆf using D n_trials n_trials + b end s history best schedule conﬁguration. We randomly picked samples from D collected from C1,C2,C3,C4,C5,C6 and used them to form the source domain (30000 samples in the TITAN X experiment and 20000 samples in the ARM GPU and ARM A53 experiments).