BabelTower: Learning to Auto-parallelized Program Translation

Authors: Yuanbo Wen, Qi Guo, Qiang Fu, Xiaqing Li, Jianxing Xu, Yanlin Tang, Yongwei Zhao, Xing Hu, Zidong Du, Ling Li, Chao Wang, Xuehai Zhou, Yunji Chen

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that Babel Tower outperforms state-of-the-art by 1.79, 6.09, and 9.39 in terms of BLEU, Code BLEU, and specifically designed Para BLEU, respectively. The CUDA code generated by Babel Tower attains a speedup of up to 347 over the sequential C code, and the developer productivity is improved by at most 3.8 .
Researcher Affiliation Collaboration 1University of Science and Technology of China 2State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences 3Cambricon Technologies, Beijing, China 4University of Chinese Academy of Sciences 5Institute of Software, Chinese Academy of Sciences.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks; it provides code examples instead.
Open Source Code No The paper states that a large-scale C-CUDA dataset is publicly available, but it does not provide an explicit statement or link for the open-source code of the Babel Tower framework itself.
Open Datasets Yes We are the first to provide a publicly-available large-scale C-CUDA dataset, enabling advanced research on the important domain of auto-parallelized program translation. As the basis of training, we create a large-scale dataset consisting of 501, 732 C functions, 129, 497 CUDA functions, as well as C-CUDA function pairs for validation and test, all of which are compute-intensive to evaluate the effectiveness of parallel semantic conversion, mined from open-source repositories.
Dataset Splits Yes The Monolingual Corpora serve as the training dataset for unsupervised training, while the Paired Corpora is for validation and test. We create a large-scale dataset consisting of 501, 732 C functions, 129, 497 CUDA functions, as well as C-CUDA function pairs for validation and test. The benchmarks we used come from the paired corpora of the built C-to-CUDA dataset, where half of them are for validation and the other half are for test, with 364 C-CUDA function pairs in total.
Hardware Specification Yes We use 32 V100 GPUs for training the pretrain model and back-translation model and a RTX 8000 for training the discriminative reranking model.
Software Dependencies No The paper mentions using a 'Transformer architecture', 'XLM model', and 'Adam optimizer', but does not provide specific version numbers for these or other software dependencies like Python, PyTorch/TensorFlow, or CUDA toolkit versions.
Experiment Setup Yes We build all models based on the Transformer architecture with 6 layers, 8 attention heads, and 1024 embedding size. For the discriminative reranking model, we use a separate classifier decoder of two MLP layers with tanh activation function. We optimize Babel Tower with Adam optimizer with learning rate 0.0001, and apply a learning rate decay schedule with 10,000 warm up steps and 0.01 decay factor.