reproducibilityindex.ai

Leveraging Automated Unit Tests for Unsupervised Code Translation

Authors: Baptiste Roziere, Jie Zhang, Francois Charton, Mark Harman, Gabriel Synnaeve, Guillaume Lample

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 EXPERIMENTS
Researcher Affiliation	Collaboration	Baptiste Rozière Facebook AI Research Paris-Dauphine University broz@fb.com Jie M. Zhang University College London zhangjie@fb.com François Charton Facebook AI Research fcharton@fb.com Mark Harman Facebook markharman@fb.com Gabriel Synnaeve Facebook AI Research gab@fb.com Guillaume Lample Facebook AI Research glample@fb.com
Pseudocode	No	The paper describes its methods in narrative text and with diagrams but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We submit our code with this submission, along with a Read Me ﬁle detailing clear steps to reproduce our results, including a script to set-up a suitable environment. We will open-source our code and release our trained models.
Open Datasets	Yes	Datasets. As Trans Coder and DOBF, we use the Git Hub public dataset available on Google Big Query ﬁltered to keep only projects with open-source licenses1.
Dataset Splits	Yes	We evaluate our models on the full validation and test sets of Trans Coder.
Hardware Specification	Yes	Our models were trained using standard hardware (Tesla V100 GPUs) and libraries (e.g. Pytorch, Cuda) for machine-learning research.
Software Dependencies	No	The paper mentions 'Pytorch, Cuda' as libraries used but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	For the online version, we set a cache warm-up parameter to ensure that we always generate new parallel examples if there are less than 500 examples in the cache for any language pair. Otherwise, we sample from the cache with probability 0.5, or create new parallel functions to add to the cache. When an example is sampled, we remove it from the cache with a given probability. The sampled elements are removed from the cache with probability 0.3, so that each element we create is trained on about 4 times in average before being removed from the cache. We initialize the cache with parallel examples created ofﬂine. During beam decoding, we compute the score of generated sequences by dividing the sum of token log-probabilities by lα where l is the sequence length. We found that taking α = 0.5 (and penalizing long generations) leads to the best performance on the validation set.