CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming
Authors: Ali Tehrani, Arijit Bhattacharjee, Le Chen, Nesreen K. Ahmed, Amir Yazdanbakhsh, Ali Jannesari
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Code Rosetta is evaluated on C++ CUDA and Fortran C++ translation tasks. It uses a customized learning framework with tailored pretraining and training objectives to effectively capture both code semantics and parallel structural nuances, enabling bidirectional translation. Our results show that Code Rosetta outperforms state-of-the-art baselines in C++ to CUDA translation by 2.9 BLEU and 1.72 Code BLEU points while improving compilation accuracy by 6.05%. |
| Researcher Affiliation | Collaboration | Ali Tehrani Jamsaz, Arijit Bhattacharjee, Le Chen, Nesreen K. Ahmed Amir Yazdanbakhsh , Ali Jannesari Iowa State University, Ames, Iowa, USA {tehrani, arbhatt9, lechen, jannesari}@iastate.edu Cisco Outshift, San Jose, CA, USA nesahmed@cisco.com Google Deep Mind, Mountain View, CA, USA ayazdan@google.com |
| Pseudocode | No | The paper uses figures to illustrate processes (e.g., Masked Language Modeling, AST Entity Recognition, Denoising Auto Encoding, Back Translation) but does not present structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code: https://coderosetta.com |
| Open Datasets | Yes | For the C++ to CUDA translation task, we use the dataset from Babel Tower [46]... We extract the C++ and Fortran subsets from the Stack V2 dataset [25]... For fine-tuning, we use the small paired C++-Fortran dataset introduced by Bin et al. [19]. |
| Dataset Splits | Yes | Paired validation and test sets: The validation set consists of 184 pairs, and the test set has 180 pairs of C++ and CUDA source code files. For fine-tuning, we use the small paired C++-Fortran dataset introduced by Bin et al. [19]. This set is also used for validation. |
| Hardware Specification | Yes | The experiments were run on a single node with four Nvidia A100 SXM4 GPUs, each with 80GB of memory. |
| Software Dependencies | Yes | We implement Code Rosetta using the Hugging Face Transformers library v4.40.1 [47]. |
| Experiment Setup | Yes | The model is a 12-layer encoder-decoder transformer, with each layer having 12 attention heads and a hidden dimension of 1,536... The training was conducted using the Adam W optimizer [24] and a batch size of 16, using gradient accumulation over two steps. For Masked Language Modeling (MLM) training, we use a learning rate of 8 10 5 and train for 100 epochs with 15% masking... For Abstract Syntax Tree (AST) entity recognition, we use a learning rate of 5 10 6 and train for ten epochs... For Denoising Auto-Encoding and Back Translation, we use a learning rate of 5 10 5 and train for 20 epochs. For Denoising Auto-Encoding, we set the masking to 15%, token dropping to 25%, and token insertion to 15%, with a denoising ratio increasing by 2.5% per epoch. Finally, for fine-tuning, we use a learning rate of 5 10 5 for ten epochs. |