reproducibilityindex.ai

Learning to Align the Source Code to the Compiled Object Code

Authors: Dor Levy, Lior Wolf

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments include short C functions, both artiﬁcial and human-written, and show that our neural network architecture is able to predict the alignment with high accuracy, outperforming known baselines. Our experiments1 show that the neural network presented is able to predict the alignment considerably more accurately than the literature baselines.
Researcher Affiliation	Collaboration	1The School of Computer Science, Tel Aviv University 2Facebook AI Research.
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	1Our code and data are publicly available at: https://github.com/Dor Levy ML/learn-align
Open Datasets	Yes	Our code and data are publicly available at: https://github.com/Dor Levy ML/learn-align and For the real-world human-written data set, we used over 53,000 short functions from 90 open-source projects, that are part of the GNU project and are written in C. Among them are grep, nano, etc.
Dataset Splits	Yes	The training set of synthetic functions contains 120,000 samples. The validation and the test sets contain 15,000 samples each. The training, validation and test sets of human-written functions contain 42,391, 5,474 and 5,253 samples, respectively.
Hardware Specification	No	One reason is the computational efﬁciency of CNNs compared to RNNs, which leads to faster computations both on GPU and CPU.
Software Dependencies	No	For example, the GCC compiler (Stallman et al., 2009) is used. In order to generate random C functions, we used pyfuzz, an open-source random program generator for python (Myint, 2013) and The Adam learning rate scheme (Kingma & Ba, 2015) is used, with a learning rate of 0.001, β1 = 0.9, β2 = 0.999, and ϵ = 1e 08.
Experiment Setup	Yes	The length of all functions has been limited to 450 tokens. The training set of synthetic functions contains 120,000 samples. The validation and the test sets contain 15,000 samples each. The training, validation and test sets of human-written functions contain 42,391, 5,474 and 5,253 samples, respectively. During training, we use batches of 32 samples each. The weights of the LSTM and attention networks are initialized uniformly in [ 1.0, 1.0]. The CNN ﬁlter weights are initialized using truncated normal distribution with a standard deviation of 0.1. The biases of the LSTM and CNN networks are initialized to 0.0, except for the biases of the LSTM forget gates, which are initialized to 1.0 in order to encourage memorization at the beginning of training (J ozefowicz et al., 2015). The Adam learning rate scheme (Kingma & Ba, 2015) is used, with a learning rate of 0.001, β1 = 0.9, β2 = 0.999, and ϵ = 1e 08.