reproducibilityindex.ai

Dynamic Neural Program Embeddings for Program Repair

Authors: Ke Wang, Rishabh Singh, Zhendong Su

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate different syntactic and semantic program embeddings on the task of classifying the types of errors that students make in their submissions to an introductory programming class and on the Code Hunt education platform. Our evaluation results show that the semantic program embeddings signiﬁcantly outperform the syntactic program embeddings based on token sequences and abstract syntax trees. In addition, we augment a search-based program repair system with predictions made from our semantic embedding and demonstrate signiﬁcantly improved search efﬁciency. We train our dynamic program embeddings on the programming submissions obtained from Assignment 2 from Microsoft-DEV204.1X: Introduction to C# offered on edx and two other problems on Microsoft Code Hunt platform. As shown in Table 3, our embeddings trained on execution traces signiﬁcantly outperform those trained on program syntax (greater than 92% accuracy compared to less than 27% for syntax-based embeddings).
Researcher Affiliation	Collaboration	Ke Wang University of California Davis, CA 95616, USA kbwang@ucdavis.edu Rishabh Singh Microsoft Research Redmond, WA 98052, USA risin@microsoft.com Zhendong Su University of California Davis, CA 95616, USA su@ucdavis.edu
Pseudocode	Yes	Algorithm 1: SARFGEN s feedback generation procedure. Algorithm 2: Incorporate pre-trained model to SARFGEN s feedback generation procedure.
Open Source Code	No	The paper does not explicitly state that the source code for the described methodology is open-source or provide a link to a code repository for it.
Open Datasets	No	We train our dynamic program embeddings on the programming submissions obtained from Assignment 2 from Microsoft-DEV204.1X: Introduction to C# offered on edx and two other problems on Microsoft Code Hunt platform. The dataset for the experiments consists of the programming submissions made to Module 2 assignment in Microsoft-DEV204.1X and two additional problems from the Microsoft Code Hunt platform. The paper describes the datasets used (e.g., from Microsoft-DEV204.1X and Code Hunt) and their sizes in Table 2, but does not provide specific links, DOIs, or citations with access information for public availability of these datasets.
Dataset Splits	Yes	Problem Program Submissions Synthetic Data Correct Incorrect Training Validation Testing Print Chessboard 2,281 742 120K 13K 15K Count Parentheses 505 315 20K 2K 2K Generate Binary Digits 518 371 22K 3K 2K
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies	No	All models are implemented in Tensor Flow. All encoders in each of the trace model have two stacked GRU layers with 200 hidden units in each layer except that the state encoder in the state trace model has one single layer of 100 hidden units. All networks are trained using the Adam optimizer (Kingma & Ba, 2014) with the learning and the decay rates set to their default values (learning rate = 0.0001, beta1 = 0.9, beta2 = 0.999) and a mini-batch size of 500. The paper mentions software like TensorFlow and Adam optimizer but does not specify their version numbers.
Experiment Setup	Yes	All encoders in each of the trace model have two stacked GRU layers with 200 hidden units in each layer except that the state encoder in the state trace model has one single layer of 100 hidden units. We adopt random initialization for weight initialization. Our vocabulary has 5,568 unique tokens (i.e., the values of all variables at each time step), each of which is embedded into a 100-dimensional vector. All networks are trained using the Adam optimizer (Kingma & Ba, 2014) with the learning and the decay rates set to their default values (learning rate = 0.0001, beta1 = 0.9, beta2 = 0.999) and a mini-batch size of 500. For the variable trace and dependency enforcement models, each trace is padded to have the same length across each batch; for the state trace model, both the number of variables in each program state as well as the length of the entire state trace are padded.