AdaTune: Adaptive Tensor Program Compilation Made Efficient

Authors: Menghao Li, Minjia Zhang, Chi Wang, Mingqin Li

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate and compare the levels of optimization obtained by Auto TVM, a stateof-the-art Learning-to-Compile technique on top of TVM, and Ada Tune. The experiment results show that Ada Tune obtains up to 115% higher GFLOPS than the baseline under the same optimization time budget. Furthermore, Ada Tune provides 1.3 3.9 speedup in optimization time over the baseline to reach the same optimization quality for a range of models across different hardware architectures.
Researcher Affiliation Industry Menghao Li Minjia Zhang Chi Wang Mingqin Li Microsoft Corporation {t-meli,minjiaz,wang.chi,mingqli}@microsoft.com
Pseudocode Yes Algorithm 1 Ada Tune 1: Input: Transformation space Se 2: Output: Selected transformation plan p 3: D {} 4: while n_iterations < max_n_iterations do 5: Q run contextual simulated annealing to collect candidates in Se using the surrogate model f and EI in Section 4.2.1 Finding the next promising batch 6: Random sample K plans p1, p2, ..., p K from Se 1 K PK k=1(standard_deviation( f(pk)) P erf(p ) 8: S pick (1 ϵt)b subset from Q 9: S S {Randomly sample ϵtb candidates} 10: for p in S do do 11: for i in (1,..,B) do Measure the hardware cost with AE 12: cv std({P erf(p)1,P erf(p)2,...,P erf(p)i}) avg({P erf(p)1,P erf(p)2,...,P erf(p)i}) 13: if cv < threshold then 14: break 15: D D (p, Perf(p)) 16: update f using D Update the model given new measurements 17: n_iterations n_iterations + b 18: p best found transformation plan
Open Source Code No We conduct ablation analysis to study the effects of the proposed techniques, and we will make the source code publicly accessible to encourage further research.
Open Datasets Yes We include four tasks: one convolutional layer sampled from Res Net-18 [20] and one batched GEMM operator from Transformer [41] on both CPU (Intel Xeon CPU E5-2690 v3 @ 2.60GHz 2600 MHz) and GPUs (Nvidia Tesla P100). We compare the end-to-end optimization results on Res Net-18 [20], VGG16 [39], and Squeezenet V1 [24].
Dataset Splits No The paper refers to using established models (ResNet-18, VGG16, SqueezeNet) but does not specify the training, validation, or test dataset splits used for their experiments.
Hardware Specification Yes We include four tasks: one convolutional layer sampled from Res Net-18 [20] and one batched GEMM operator from Transformer [41] on both CPU (Intel Xeon CPU E5-2690 v3 @ 2.60GHz 2600 MHz) and GPUs (Nvidia Tesla P100).
Software Dependencies No We implement Ada Tune in Python, and we leverage scikit-learn [33] and forestci [1] to implement the surrogate model and optimizer.
Experiment Setup Yes We use n=500 for all experiments and set micro-batch size B = 50 in Ada Tune. We use the default settings for other hyperparameters provided by Auto TVM. The detailed parameter settings are included in Appendix A.