reproducibilityindex.ai

DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Authors: Marie-Anne Lachaux, Baptiste Roziere, Marc Szafraniec, Guillaume Lample

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that models pre-trained with DOBF signiﬁcantly outperform existing approaches on multiple downstream tasks, providing relative improvements of up to 12.2% in unsupervised code translation, and 5.3% in natural language code search. Incidentally, we found that our pre-trained model is able to deobfuscate fully obfuscated source ﬁles, and to suggest descriptive variable names.
Researcher Affiliation	Collaboration	Marie-Anne Lachaux* Facebook AI Research malachaux@fb.com Baptiste Roziere Facebook AI Research Paris-Dauphine University broz@fb.com Marc Szafraniec Facebook AI Research szafraniec@fb.com Guillaume Lample Facebook AI Research glample@fb.com
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	No	The paper states: 'The data, code and models from Code XGLUE and Trans Coder are available respectively under the MIT and the Creative Commons license.' This refers to third-party code and datasets, not the authors' own implementation code for DOBF.
Open Datasets	Yes	As in Roziere et al. [2020], we use the Git Hub public dataset available on Google Big Query and select all Python and Java ﬁles within the projects with licenses authorizing use for research purposes.
Dataset Splits	Yes	We only consider the Java and Python tasks with an encoder in the model architecture for which the training, validation, and test sets are publicly available. ... We compute the overall subtoken precision, recall and F1 scores averaged over each ﬁle in our validation and test datasets.
Hardware Specification	Yes	We implement our models in Py Torch Paszke et al. [2019] and train them on 32 V100 GPUs for eight days. ... For Trans Coder, we use a learning rate of 10 4 as in Roziere et al. [2020] and we train the models for 2 day on 32 Tesla V100 GPUs.
Software Dependencies	No	The paper states: 'We implement our models in Py Torch Paszke et al. [2019]' but does not provide specific version numbers for PyTorch or any other software dependencies.
Experiment Setup	Yes	We train models with the same architecture and tokenizer as Code BERT Feng et al. [2020] and Graph Code BERT Guo et al. [2020] in order to provide fair comparisons: 12 layers, 12 attention heads and a hidden dimension of 768. ... We optimize DOBF with the Adam optimizer Kingma and Ba [2014] and an inverse square-root learning rate scheduler Vaswani et al. [2017]. ... We train DOBF with three different obfuscation probability parameters: pobf 2 {0, 0.5, 1}. ... We perform a grid search on ﬁve learning rate parameters ranging from 5.10 6 to 10 4 and we select the best parameter on the validation dataset.