Learning to Optimize Tensor Programs

Authors: Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our framework delivers performance that is competitive with state-of-the-art hand-tuned libraries for low-power CPUs, mobile GPUs, and server-class GPUs. We provide a detailed empirical analysis of component design choices in this framework. Experimental results on real-world DL workloads show that our framework yields end-to-end performance improvements ranging from 1.2 to 3.8 over existing frameworks.
Researcher Affiliation Academia 1Paul G. Allen School of Computer Science & Engineering, University of Washington 2Shanghai Jiao Tong University
Pseudocode Yes Algorithm 1: Learning to Optimize Tensor Programs
Open Source Code Yes Our framework can be found at https://tvm.ai.
Open Datasets Yes Component evaluations were based on convolution workloads in Res Net-18 [14] for Image Net classification (Table 1).
Dataset Splits No The paper mentions using ResNet-18 and MobileNet, which are typically evaluated on standard datasets with predefined splits, but it does not explicitly provide the specific training, validation, or test dataset splits (e.g., percentages or counts) used for their experiments.
Hardware Specification Yes We compared our approach to existing DL frameworks backed by highly engineered hardware-specific libraries on diverse hardware back-ends: a server class GPU, an embedded CPU, and a mobile GPU. The baselines were: cu DNN v7 for the NVIDIA GPU, TFLite(commit: 7558b085) for the Cortex-A53, and the ARM Compute Library (v18.03) for the ARM Mali GPU.
Software Dependencies Yes The baselines were: cu DNN v7 for the NVIDIA GPU, TFLite(commit: 7558b085) for the Cortex-A53, and the ARM Compute Library (v18.03) for the ARM Mali GPU. Our baselines were: MXNet (v1.1), Tensorflow (v1.7) for the GPU, TFLite(commit: 7558b085) for the Cortex A53, and ARM Compute Library (v18.03) for the ARM Mali GPU.
Experiment Setup Yes Algorithm 1: Learning to Optimize Tensor Programs Input : Transformation space Se Output : Selected schedule configuration s D while n_trials < max_n_trials do // Pick the next promising batch Q run parallel simulated annealing to collect candidates in Se using energy function ˆf S run greedy submodular optimization to pick (1 ϵ)b-subset from Q by maximizing Equation 3 S S { Randomly sample ϵb candidates. } // Run measurement on hardware environment for s in S do c f(g(e, s)); D D {(e, s, c)} end // Update cost model update ˆf using D n_trials n_trials + b end s history best schedule configuration. We randomly picked samples from D collected from C1,C2,C3,C4,C5,C6 and used them to form the source domain (30000 samples in the TITAN X experiment and 20000 samples in the ARM GPU and ARM A53 experiments).