Learning to Align the Source Code to the Compiled Object Code

Authors: Dor Levy, Lior Wolf

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments include short C functions, both artificial and human-written, and show that our neural network architecture is able to predict the alignment with high accuracy, outperforming known baselines. Our experiments1 show that the neural network presented is able to predict the alignment considerably more accurately than the literature baselines.
Researcher Affiliation Collaboration 1The School of Computer Science, Tel Aviv University 2Facebook AI Research.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes 1Our code and data are publicly available at: https://github.com/Dor Levy ML/learn-align
Open Datasets Yes Our code and data are publicly available at: https://github.com/Dor Levy ML/learn-align and For the real-world human-written data set, we used over 53,000 short functions from 90 open-source projects, that are part of the GNU project and are written in C. Among them are grep, nano, etc.
Dataset Splits Yes The training set of synthetic functions contains 120,000 samples. The validation and the test sets contain 15,000 samples each. The training, validation and test sets of human-written functions contain 42,391, 5,474 and 5,253 samples, respectively.
Hardware Specification No One reason is the computational efficiency of CNNs compared to RNNs, which leads to faster computations both on GPU and CPU.
Software Dependencies No For example, the GCC compiler (Stallman et al., 2009) is used. In order to generate random C functions, we used pyfuzz, an open-source random program generator for python (Myint, 2013) and The Adam learning rate scheme (Kingma & Ba, 2015) is used, with a learning rate of 0.001, β1 = 0.9, β2 = 0.999, and ϵ = 1e 08.
Experiment Setup Yes The length of all functions has been limited to 450 tokens. The training set of synthetic functions contains 120,000 samples. The validation and the test sets contain 15,000 samples each. The training, validation and test sets of human-written functions contain 42,391, 5,474 and 5,253 samples, respectively. During training, we use batches of 32 samples each. The weights of the LSTM and attention networks are initialized uniformly in [ 1.0, 1.0]. The CNN filter weights are initialized using truncated normal distribution with a standard deviation of 0.1. The biases of the LSTM and CNN networks are initialized to 0.0, except for the biases of the LSTM forget gates, which are initialized to 1.0 in order to encourage memorization at the beginning of training (J ozefowicz et al., 2015). The Adam learning rate scheme (Kingma & Ba, 2015) is used, with a learning rate of 0.001, β1 = 0.9, β2 = 0.999, and ϵ = 1e 08.