reproducibilityindex.ai

Language-Agnostic Representation Learning of Source Code from Structure and Context

Authors: Daniel Zügner, Tobias Kirschstein, Michele Catasta, Jure Leskovec, Stephan Günnemann

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use two datasets comprising 5 different programming languages in total, and evaluate the representations learned by our model on the task of code summarization, where the model predicts a method s name based on its body. Besides setting the state-of-the-art on all ﬁve languages for single-language training, we also train the ﬁrst multilingual model for code summarization.
Researcher Affiliation	Collaboration	Daniel Z ugner, Tobias Kirschstein Technical University of Munich {zuegnerd,kirschto}@in.tum.de Michele Catasta Stanford University pirroh@cs.stanford.edu Jure Leskovec Stanford University jure@cs.stanford.edu Stephan G unnemann Technical University of Munich guennemann@in.tum.de
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code at www.daml.in.tum.de/code-transformer, demo at code-transformer.org.
Open Datasets	Yes	To highlight the beneﬁt of only relying on language-agnostic representations such as source code and abstract syntax trees, we evaluate on challenging datasets in four programming languages introduced in the Code Search Net (CSN) Challenge (Husain et al., 2019): Python, Javascript, Go, and Ruby. We further evaluate on Java-small (Allamanis et al., 2016), a popular and challenging code summarization dataset. It contains 11 open-source Java projects. We use the split as in Alon et al. (2019a), where 9 of these projects are used for training, one for validation, and one for test.
Dataset Splits	Yes	We use the split as in Alon et al. (2019a), where 9 of these projects are used for training, one for validation, and one for test. Table 1: Dataset statistics. Samples per partition Dataset Train Val. Test CSN-Python 412,178 23,107 22,176 CSN-Javascript 123,889 8,253 6,483 CSN-Ruby 48,791 2,209 2,279 CSN-Go 317,832 14,242 14,291 Java-small 691,974 23,844 57,088
Hardware Specification	No	The paper mentions experimental setup details such as optimizer, learning rate, batch size, but does not specify any hardware components like GPU or CPU models.
Software Dependencies	No	The paper mentions tools like 'Pygments language-speciﬁc tokenizer', 'Semantic' and 'java-parser', but it does not specify any version numbers for these software dependencies.
Experiment Setup	Yes	Table 7: Code Summarization hyperparameters. Activation GELU Num. layers 3 d 1024 d F F 2048 pdropout 0.2 Num. heads 8. For all our experiments, we use a Transformer Decoder with one layer and teacher forcing to generate 6 output sub tokens. We also employ label smoothing of 0.1. As optimizer, we use Adam with a learning rate of 8e 5 and weight decay of 3e 5. Batch size during training is 8 with a simulated batch size of 128 achieved by gradient accumulation.