Learning to Align the Source Code to the Compiled Object Code
Authors: Dor Levy, Lior Wolf
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments include short C functions, both artificial and human-written, and show that our neural network architecture is able to predict the alignment with high accuracy, outperforming known baselines. Our experiments1 show that the neural network presented is able to predict the alignment considerably more accurately than the literature baselines. |
| Researcher Affiliation | Collaboration | 1The School of Computer Science, Tel Aviv University 2Facebook AI Research. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Our code and data are publicly available at: https://github.com/Dor Levy ML/learn-align |
| Open Datasets | Yes | Our code and data are publicly available at: https://github.com/Dor Levy ML/learn-align and For the real-world human-written data set, we used over 53,000 short functions from 90 open-source projects, that are part of the GNU project and are written in C. Among them are grep, nano, etc. |
| Dataset Splits | Yes | The training set of synthetic functions contains 120,000 samples. The validation and the test sets contain 15,000 samples each. The training, validation and test sets of human-written functions contain 42,391, 5,474 and 5,253 samples, respectively. |
| Hardware Specification | No | One reason is the computational efficiency of CNNs compared to RNNs, which leads to faster computations both on GPU and CPU. |
| Software Dependencies | No | For example, the GCC compiler (Stallman et al., 2009) is used. In order to generate random C functions, we used pyfuzz, an open-source random program generator for python (Myint, 2013) and The Adam learning rate scheme (Kingma & Ba, 2015) is used, with a learning rate of 0.001, β1 = 0.9, β2 = 0.999, and ϵ = 1e 08. |
| Experiment Setup | Yes | The length of all functions has been limited to 450 tokens. The training set of synthetic functions contains 120,000 samples. The validation and the test sets contain 15,000 samples each. The training, validation and test sets of human-written functions contain 42,391, 5,474 and 5,253 samples, respectively. During training, we use batches of 32 samples each. The weights of the LSTM and attention networks are initialized uniformly in [ 1.0, 1.0]. The CNN filter weights are initialized using truncated normal distribution with a standard deviation of 0.1. The biases of the LSTM and CNN networks are initialized to 0.0, except for the biases of the LSTM forget gates, which are initialized to 1.0 in order to encourage memorization at the beginning of training (J ozefowicz et al., 2015). The Adam learning rate scheme (Kingma & Ba, 2015) is used, with a learning rate of 0.001, β1 = 0.9, β2 = 0.999, and ϵ = 1e 08. |