Leveraging Automated Unit Tests for Unsupervised Code Translation
Authors: Baptiste Roziere, Jie Zhang, Francois Charton, Mark Harman, Gabriel Synnaeve, Guillaume Lample
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 EXPERIMENTS |
| Researcher Affiliation | Collaboration | Baptiste Rozière Facebook AI Research Paris-Dauphine University broz@fb.com Jie M. Zhang University College London zhangjie@fb.com François Charton Facebook AI Research fcharton@fb.com Mark Harman Facebook markharman@fb.com Gabriel Synnaeve Facebook AI Research gab@fb.com Guillaume Lample Facebook AI Research glample@fb.com |
| Pseudocode | No | The paper describes its methods in narrative text and with diagrams but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We submit our code with this submission, along with a Read Me file detailing clear steps to reproduce our results, including a script to set-up a suitable environment. We will open-source our code and release our trained models. |
| Open Datasets | Yes | Datasets. As Trans Coder and DOBF, we use the Git Hub public dataset available on Google Big Query filtered to keep only projects with open-source licenses1. |
| Dataset Splits | Yes | We evaluate our models on the full validation and test sets of Trans Coder. |
| Hardware Specification | Yes | Our models were trained using standard hardware (Tesla V100 GPUs) and libraries (e.g. Pytorch, Cuda) for machine-learning research. |
| Software Dependencies | No | The paper mentions 'Pytorch, Cuda' as libraries used but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | For the online version, we set a cache warm-up parameter to ensure that we always generate new parallel examples if there are less than 500 examples in the cache for any language pair. Otherwise, we sample from the cache with probability 0.5, or create new parallel functions to add to the cache. When an example is sampled, we remove it from the cache with a given probability. The sampled elements are removed from the cache with probability 0.3, so that each element we create is trained on about 4 times in average before being removed from the cache. We initialize the cache with parallel examples created offline. During beam decoding, we compute the score of generated sequences by dividing the sum of token log-probabilities by lα where l is the sequence length. We found that taking α = 0.5 (and penalizing long generations) leads to the best performance on the validation set. |