Learning and Evaluating Contextual Embedding of Source Code

Authors: Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, Kensen Shi

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from Git Hub, which we use to pre-train Cu BERT, an open-sourced codeunderstanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We fine-tune Cu BERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, Bi LSTM and Transformer models, as well as published state-of-the-art models, showing that Cu BERT outperforms them all, even with shorter training, and with fewer labeled examples.
Researcher Affiliation Collaboration 1Indian Institute of Science, Bangalore, India 2Google Brain, Mountain View, USA. Correspondence to: Aditya Kanade <kanade@iisc.ac.in>, Petros Maniatis <maniatis@google.com>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It describes the models and experimental procedures in narrative text and tables.
Open Source Code Yes We make the models and datasets publicly available.1 1https://github.com/google-research/ google-research/tree/master/cubert
Open Datasets Yes First, we curate a massive, deduplicated corpus of 7.4M Python files from Git Hub, which we use to pre-train Cu BERT, an open-sourced codeunderstanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task... We use the ETH Py150 corpus (Raychev et al., 2016) to generate datasets for the fine-tuning tasks... We call the resulting corpus the ETH Py150 Open corpus.2 2https://github.com/google-research-datasets/eth_py150_open
Dataset Splits Yes This corpus consists of 150K Python files from Git Hub, and is partitioned into a training split (100K files) and a test split (50K files). We held out 10K files from the training split as a validation split. ... This is our Python fine-tuning code corpus, and it consists of 74,749 training files, 8,302 validation files, and 41,457 test files.
Hardware Specification Yes We used TPUs for training our models, except for pretraining Word2Vec embeddings, and the pointer model by Vasic et al. (2019). For the rest, and for all evaluations, we used P100 or V100 GPUs.
Software Dependencies No The paper mentions using the "standard Python tokenizer (the tokenize package)", "Gen Sim (ˇReh uˇrek & Sojka, 2010)", and the "Subword Text Encoder from the Tensor2Tensor project (Vaswani et al., 2018)". However, it does not specify version numbers for Python, tokenize, Gen Sim, or Tensor2Tensor, which are necessary for reproducible software dependencies.
Experiment Setup Yes We pre-train Cu BERT with the default configuration of the BERT Large model, one model per example length (128, 256, 512, and 1,024 subword tokens) with batch sizes of 8,192, 4,096, 2,048, and 1,024 respectively, and the default BERT learning rate of 1 10 4. Fine-tuned models also used the same batch sizes as for pre-training, and BERT s default learning rate (5 10 5). For both, we gradually warm up the learning rate for the first 10 % of examples, which is BERT s default value.