Language-Agnostic Representation Learning of Source Code from Structure and Context

Authors: Daniel Zügner, Tobias Kirschstein, Michele Catasta, Jure Leskovec, Stephan Günnemann

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use two datasets comprising 5 different programming languages in total, and evaluate the representations learned by our model on the task of code summarization, where the model predicts a method s name based on its body. Besides setting the state-of-the-art on all five languages for single-language training, we also train the first multilingual model for code summarization.
Researcher Affiliation Collaboration Daniel Z ugner, Tobias Kirschstein Technical University of Munich {zuegnerd,kirschto}@in.tum.de Michele Catasta Stanford University pirroh@cs.stanford.edu Jure Leskovec Stanford University jure@cs.stanford.edu Stephan G unnemann Technical University of Munich guennemann@in.tum.de
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code at www.daml.in.tum.de/code-transformer, demo at code-transformer.org.
Open Datasets Yes To highlight the benefit of only relying on language-agnostic representations such as source code and abstract syntax trees, we evaluate on challenging datasets in four programming languages introduced in the Code Search Net (CSN) Challenge (Husain et al., 2019): Python, Javascript, Go, and Ruby. We further evaluate on Java-small (Allamanis et al., 2016), a popular and challenging code summarization dataset. It contains 11 open-source Java projects. We use the split as in Alon et al. (2019a), where 9 of these projects are used for training, one for validation, and one for test.
Dataset Splits Yes We use the split as in Alon et al. (2019a), where 9 of these projects are used for training, one for validation, and one for test. Table 1: Dataset statistics. Samples per partition Dataset Train Val. Test CSN-Python 412,178 23,107 22,176 CSN-Javascript 123,889 8,253 6,483 CSN-Ruby 48,791 2,209 2,279 CSN-Go 317,832 14,242 14,291 Java-small 691,974 23,844 57,088
Hardware Specification No The paper mentions experimental setup details such as optimizer, learning rate, batch size, but does not specify any hardware components like GPU or CPU models.
Software Dependencies No The paper mentions tools like 'Pygments language-specific tokenizer', 'Semantic' and 'java-parser', but it does not specify any version numbers for these software dependencies.
Experiment Setup Yes Table 7: Code Summarization hyperparameters. Activation GELU Num. layers 3 d 1024 d F F 2048 pdropout 0.2 Num. heads 8. For all our experiments, we use a Transformer Decoder with one layer and teacher forcing to generate 6 output sub tokens. We also employ label smoothing of 0.1. As optimizer, we use Adam with a learning rate of 8e 5 and weight decay of 3e 5. Batch size during training is 8 with a simulated batch size of 128 achieved by gradient accumulation.