code2seq: Generating Sequences from Structured Representations of Code

Authors: Uri Alon, Shaked Brody, Omer Levy, Eran Yahav

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our approach for two tasks, two programming languages, and four datasets of up to 16M examples. Our model significantly outperforms previous models that were specifically designed for programming languages, as well as state-of-the-art NMT models. An online demo of our model is available at http://code2seq.org. Our code, data and trained models are available at http://github.com/tech-srl/code2seq. To examine the importance of each component of the model, we conduct a thorough ablation study.
Researcher Affiliation Collaboration Uri Alon Technion urialon@cs.technion.ac.il Shaked Brody Technion shakedbr@cs.technion.ac.il Omer Levy Facebook AI Research omerlevy@gmail.com Eran Yahav Technion yahave@cs.technion.ac.il
Pseudocode No No pseudocode or algorithm blocks are present in the paper.
Open Source Code Yes Our code, data and trained models are available at http://github.com/tech-srl/code2seq.
Open Datasets Yes Java-med A new dataset of the 1000 top-starred Java projects from Git Hub. ... This dataset contains about 4M examples and we make it publicly available. Java-large A new dataset of the 9500 top-starred Java projects from Git Hub ... This dataset contains about 16M examples and we make it publicly available. ... We used the dataset of Code NN (Iyer et al., 2016)...
Dataset Splits Yes Java-small ... we took 9 projects for training, 1 project for validation and 1 project as our test set. Java-med ... We randomly select 800 projects for training, 100 for validation and 100 for testing. Java-large ... We randomly select 9000 projects for training, 250 for validation and 300 for testing.
Hardware Specification No The paper mentions training on GPUs ('keeping training feasible in the GPU s memory') but does not specify any particular GPU models, CPU models, or detailed hardware specifications.
Software Dependencies No The paper mentions using Open NMT for baselines but does not provide specific version numbers for it or any other software dependencies.
Experiment Setup Yes The values of all of the parameters are initialized using the initialization heuristic of Glorot and Bengio (2010). We optimize the cross-entropy loss (Rubinstein, 1999; 2001) with a Nesterov momentum (Nesterov, 1983) of 0.95 and an initial learning rate of 0.01, decayed by a factor of 0.95 every epoch. For the Code Summarization task, we apply dropout (Srivastava et al., 2014) of 0.25 on the input vectors xj, and 0.7 for the Code Captioning task because of the smaller number of examples in the C# dataset. We apply a recurrent dropout of 0.5 on the LSTM that encodes the AST paths. We used dtokens = dnodes = dhidden = dtarget = 128. For the Code Summarization task, each LSTM that encodes the AST paths had 128 units and the decoder LSTM had 320 units. For the Code Captioning task, to support the longer target sequences, each encoder LSTM had 256 units and the decoder was of size 512.