DynaTune: Dynamic Tensor Program Optimization in Deep Neural Network Compilation

Authors: Minjia Zhang, Menghao Li, Chi Wang, Mingqin Li

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate and compare Dyna Tune with the state-of-the-art DL compiler. The experiment results show that Dyna Tune is 1.2 2.4 times faster to achieve the same optimization quality for a range of models across different hardware architectures.
Researcher Affiliation Industry Minjia Zhang , Menghao Li*, Chi Wang & Mingqin Li Microsoft Corporation {minjiaz,t-meli,wang.chi,mingqli}@microsoft.com
Pseudocode Yes Algorithm 1 Dyna Tune: Dynamic Multi-Tensor-Operator Optimization
Open Source Code No The paper does not explicitly provide a link to the source code for DynaTune or state that it is open-sourced or available.
Open Datasets Yes We include four tasks, covering both CPU and GPU hardware: Res Net-18 (He et al., 2016) and Squeeze Net (Iandola et al., 2016) on CPU... VGG (Simonyan & Zisserman, 2015) Transformer Encoder (Iandola et al., 2016) on GPUs...
Dataset Splits No The paper refers to 'train', 'validation', and 'test' in the context of a general compilation pipeline (Fig. 1) but does not provide specific train/validation/test dataset splits (e.g., percentages or counts) for the models evaluated in their experiments.
Hardware Specification Yes Res Net-18 (He et al., 2016) and Squeeze Net (Iandola et al., 2016) on CPU (Intel Xeon CPU E5-2690 v3 @ 2.60GHz 2600 MHz), VGG (Simonyan & Zisserman, 2015) Transformer Encoder (Iandola et al., 2016) on GPUs (Nvidia Tesla P100)
Software Dependencies No The paper mentions 'Auto TVM', 'Python', and 'emcee' for implementation, but it does not specify concrete version numbers for these software components (e.g., Python 3.x, emcee vX.Y).
Experiment Setup Yes We use the default hyperparameters provided by Auto TVM for the underlying code optimization. To obtain the parameter posterior, we run the ensemble MCMC with 10 walkers and 500 sampling steps. For UCB, we choose a default value of C = 2 suggested by the theory in Auer et al. (2002), which we find to be robust to different range of latencies. When the initial latency is <1ms, we empirically find that C=0.2 leads to increased performance, which we report.