Compiler Auto-Vectorization with Imitation Learning

Authors: Charith Mendis, Cambridge Yang, Yewen Pu, Dr.Saman Amarasinghe, Michael Carbin

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that the learnt policy, Vemal, produces a vectorization scheme that is better than the well-tuned heuristics used by the LLVM compiler. More specifically, the learnt agent produces a vectorization strategy that has a 22.6% higher average reduction in cost compared to the LLVM compiler when measured using its own cost model, and matches the runtime performance of the ILP based solution in 5 out of 7 applications in the NAS benchmark suite. Evaluation of the learnt policy on representative compiler benchmark suites (SPEC2006fp C/C++ (Henning, 2006), SPEC2017fp C/C++ (Bucek et al., 2018) and NAS benchmark suites (Bailey et al., 1991)). Specifically, we show that the learnt policy has an average static cost reduction which is 22.6% higher than LLVM as measured by LLVM s own static cost model. We also show that the learnt policy achieves a geometric mean runtime speedup of 1.015 on the NAS benchmark suite (held out during training) compared to LLVM, matching runtime speedups of go SLP in 5 out of 7 applications.
Researcher Affiliation Academia Charith Mendis MIT CSAIL charithm@mit.edu Cambridge Yang MIT CSAIL camyang@csail.mit.edu Yewen Pu MIT CSAIL yewenpu@mit.edu Saman Amarasinghe MIT CSAIL saman@csail.mit.edu Michael Carbin MIT CSAIL mcarbin@csail.mit.edu
Pseudocode No No explicit pseudocode or algorithm blocks are provided.
Open Source Code No The paper does not provide an explicit statement or link to the open-source code for the described methodology.
Open Datasets Yes Evaluation of the learnt policy on representative compiler benchmark suites (SPEC2006fp C/C++ (Henning, 2006), SPEC2017fp C/C++ (Bucek et al., 2018) and NAS benchmark suites (Bailey et al., 1991)).
Dataset Splits No The paper states, 'Finally, we split the dataset into a training set (80%) and a test set (20%) such that the proportionality of the vectorized and non-vectorized functions remains the same for both.' It does not mention a separate validation split.
Hardware Specification Yes All benchmark programs are run on a Haswell Intel(R) Xeon(R) CPU E5-2680 v3 machine running at 2.50GHz with 32k B L1 and 256k B L2 cache sizes.
Software Dependencies Yes We use LLVM SLP (clang-6.0), go SLP and a random packing agent as our baselines for comparison.
Experiment Setup Yes We pre-train each network using 3000 randomly sampled batches. At the beginning of each epoch, we randomly sample 400 of partitioned functions and augment our dataset using rollouts obtained for those functions. We use a mixed student-teacher policy similar to that used by the original DAGGER algorithm (Ross et al., 2011) to take rollouts, with the probability of choosing the teacher agent (ILP solver) exponentially decayed by β = 0.9 at the beginning of each epoch. Finally, we use go SLP s ILP solution to compute the optimal actions for each state we encounter during rollouts. We use 20 message passing iterations in our GGNN. We train the neural network using stochastic gradient descent with momentum of 0.9, initial learning rate of 0.002 and with an exponentially decaying learning rate schedule (decay of 0.95). We randomly sample 50 state-action pairs for each batch and sample replay buffer batch size number of batches for each epoch.