reproducibilityindex.ai

Learning and Evaluating Contextual Embedding of Source Code

Authors: Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, Kensen Shi

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Speciﬁcally, ﬁrst, we curate a massive, deduplicated corpus of 7.4M Python ﬁles from Git Hub, which we use to pre-train Cu BERT, an open-sourced codeunderstanding BERT model; and, second, we create an open-sourced benchmark that comprises ﬁve classiﬁcation tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We ﬁne-tune Cu BERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, Bi LSTM and Transformer models, as well as published state-of-the-art models, showing that Cu BERT outperforms them all, even with shorter training, and with fewer labeled examples.
Researcher Affiliation	Collaboration	1Indian Institute of Science, Bangalore, India 2Google Brain, Mountain View, USA. Correspondence to: Aditya Kanade <kanade@iisc.ac.in>, Petros Maniatis <maniatis@google.com>.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. It describes the models and experimental procedures in narrative text and tables.
Open Source Code	Yes	We make the models and datasets publicly available.1 1https://github.com/google-research/ google-research/tree/master/cubert
Open Datasets	Yes	First, we curate a massive, deduplicated corpus of 7.4M Python ﬁles from Git Hub, which we use to pre-train Cu BERT, an open-sourced codeunderstanding BERT model; and, second, we create an open-sourced benchmark that comprises ﬁve classiﬁcation tasks and one program-repair task... We use the ETH Py150 corpus (Raychev et al., 2016) to generate datasets for the ﬁne-tuning tasks... We call the resulting corpus the ETH Py150 Open corpus.2 2https://github.com/google-research-datasets/eth_py150_open
Dataset Splits	Yes	This corpus consists of 150K Python ﬁles from Git Hub, and is partitioned into a training split (100K ﬁles) and a test split (50K ﬁles). We held out 10K ﬁles from the training split as a validation split. ... This is our Python ﬁne-tuning code corpus, and it consists of 74,749 training ﬁles, 8,302 validation ﬁles, and 41,457 test ﬁles.
Hardware Specification	Yes	We used TPUs for training our models, except for pretraining Word2Vec embeddings, and the pointer model by Vasic et al. (2019). For the rest, and for all evaluations, we used P100 or V100 GPUs.
Software Dependencies	No	The paper mentions using the "standard Python tokenizer (the tokenize package)", "Gen Sim (ˇReh uˇrek & Sojka, 2010)", and the "Subword Text Encoder from the Tensor2Tensor project (Vaswani et al., 2018)". However, it does not specify version numbers for Python, tokenize, Gen Sim, or Tensor2Tensor, which are necessary for reproducible software dependencies.
Experiment Setup	Yes	We pre-train Cu BERT with the default conﬁguration of the BERT Large model, one model per example length (128, 256, 512, and 1,024 subword tokens) with batch sizes of 8,192, 4,096, 2,048, and 1,024 respectively, and the default BERT learning rate of 1 10 4. Fine-tuned models also used the same batch sizes as for pre-training, and BERT s default learning rate (5 10 5). For both, we gradually warm up the learning rate for the ﬁrst 10 % of examples, which is BERT s default value.