DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Authors: Marie-Anne Lachaux, Baptiste Roziere, Marc Szafraniec, Guillaume Lample

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that models pre-trained with DOBF significantly outperform existing approaches on multiple downstream tasks, providing relative improvements of up to 12.2% in unsupervised code translation, and 5.3% in natural language code search. Incidentally, we found that our pre-trained model is able to deobfuscate fully obfuscated source files, and to suggest descriptive variable names.
Researcher Affiliation Collaboration Marie-Anne Lachaux* Facebook AI Research malachaux@fb.com Baptiste Roziere Facebook AI Research Paris-Dauphine University broz@fb.com Marc Szafraniec Facebook AI Research szafraniec@fb.com Guillaume Lample Facebook AI Research glample@fb.com
Pseudocode No The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code No The paper states: 'The data, code and models from Code XGLUE and Trans Coder are available respectively under the MIT and the Creative Commons license.' This refers to third-party code and datasets, not the authors' own implementation code for DOBF.
Open Datasets Yes As in Roziere et al. [2020], we use the Git Hub public dataset available on Google Big Query and select all Python and Java files within the projects with licenses authorizing use for research purposes.
Dataset Splits Yes We only consider the Java and Python tasks with an encoder in the model architecture for which the training, validation, and test sets are publicly available. ... We compute the overall subtoken precision, recall and F1 scores averaged over each file in our validation and test datasets.
Hardware Specification Yes We implement our models in Py Torch Paszke et al. [2019] and train them on 32 V100 GPUs for eight days. ... For Trans Coder, we use a learning rate of 10 4 as in Roziere et al. [2020] and we train the models for 2 day on 32 Tesla V100 GPUs.
Software Dependencies No The paper states: 'We implement our models in Py Torch Paszke et al. [2019]' but does not provide specific version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes We train models with the same architecture and tokenizer as Code BERT Feng et al. [2020] and Graph Code BERT Guo et al. [2020] in order to provide fair comparisons: 12 layers, 12 attention heads and a hidden dimension of 768. ... We optimize DOBF with the Adam optimizer Kingma and Ba [2014] and an inverse square-root learning rate scheduler Vaswani et al. [2017]. ... We train DOBF with three different obfuscation probability parameters: pobf 2 {0, 0.5, 1}. ... We perform a grid search on five learning rate parameters ranging from 5.10 6 to 10 4 and we select the best parameter on the validation dataset.