CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming

Authors: Ali Tehrani, Arijit Bhattacharjee, Le Chen, Nesreen K. Ahmed, Amir Yazdanbakhsh, Ali Jannesari

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Code Rosetta is evaluated on C++ CUDA and Fortran C++ translation tasks. It uses a customized learning framework with tailored pretraining and training objectives to effectively capture both code semantics and parallel structural nuances, enabling bidirectional translation. Our results show that Code Rosetta outperforms state-of-the-art baselines in C++ to CUDA translation by 2.9 BLEU and 1.72 Code BLEU points while improving compilation accuracy by 6.05%.
Researcher Affiliation Collaboration Ali Tehrani Jamsaz, Arijit Bhattacharjee, Le Chen, Nesreen K. Ahmed Amir Yazdanbakhsh , Ali Jannesari Iowa State University, Ames, Iowa, USA {tehrani, arbhatt9, lechen, jannesari}@iastate.edu Cisco Outshift, San Jose, CA, USA nesahmed@cisco.com Google Deep Mind, Mountain View, CA, USA ayazdan@google.com
Pseudocode No The paper uses figures to illustrate processes (e.g., Masked Language Modeling, AST Entity Recognition, Denoising Auto Encoding, Back Translation) but does not present structured pseudocode or algorithm blocks.
Open Source Code Yes Code: https://coderosetta.com
Open Datasets Yes For the C++ to CUDA translation task, we use the dataset from Babel Tower [46]... We extract the C++ and Fortran subsets from the Stack V2 dataset [25]... For fine-tuning, we use the small paired C++-Fortran dataset introduced by Bin et al. [19].
Dataset Splits Yes Paired validation and test sets: The validation set consists of 184 pairs, and the test set has 180 pairs of C++ and CUDA source code files. For fine-tuning, we use the small paired C++-Fortran dataset introduced by Bin et al. [19]. This set is also used for validation.
Hardware Specification Yes The experiments were run on a single node with four Nvidia A100 SXM4 GPUs, each with 80GB of memory.
Software Dependencies Yes We implement Code Rosetta using the Hugging Face Transformers library v4.40.1 [47].
Experiment Setup Yes The model is a 12-layer encoder-decoder transformer, with each layer having 12 attention heads and a hidden dimension of 1,536... The training was conducted using the Adam W optimizer [24] and a batch size of 16, using gradient accumulation over two steps. For Masked Language Modeling (MLM) training, we use a learning rate of 8 10 5 and train for 100 epochs with 15% masking... For Abstract Syntax Tree (AST) entity recognition, we use a learning rate of 5 10 6 and train for ten epochs... For Denoising Auto-Encoding and Back Translation, we use a learning rate of 5 10 5 and train for 20 epochs. For Denoising Auto-Encoding, we set the masking to 15%, token dropping to 25%, and token insertion to 15%, with a denoising ratio increasing by 2.5% per epoch. Finally, for fine-tuning, we use a learning rate of 5 10 5 for ten epochs.