Structured Generative Models of Natural Source Code

Authors: Chris Maddison, Daniel Tarlow

ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study the problem of building generative models of natural source code (NSC); that is, source code written by humans and meant to be understood by humans. Our primary contribution is to describe new generative models that are tailored to NSC. The models are based on probabilistic context free grammars (PCFGs) and neuro-probabilistic language models (Mnih & Teh, 2012), which are extended to incorporate additional source code-specific structure. These models can be efficiently trained on a corpus of source code and outperform a variety of less structured baselines in terms of predictive log likelihoods on held-out data.In all experiments, we used a dataset that we collected from Top Coder.com.
Researcher Affiliation Collaboration Chris J. Maddison CMADDIS@CS.TORONTO.EDU University of Toronto Daniel Tarlow DTARLOW@MICROSOFT.COM Microsoft Research
Pseudocode Yes Algorithm 1 Sampling from LTTs.
Open Source Code No The paper mentions providing 'samples of full source code files' in the Supplementary Material, but it does not state that the source code for the described methodology itself is open-source or publicly available.
Open Datasets No In all experiments, we used a dataset that we collected from Top Coder.com. The paper states they collected the dataset but does not provide a direct link, DOI, or a formal citation with author/year for accessing this specific collected dataset.
Dataset Splits Yes The overall split proportions are 20% test, 10% validation, and 70% train.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, or memory) used for running the experiments.
Software Dependencies No The paper mentions 'Roslyn C# compiler' and 'Ada Grad' but does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup Yes All experiments use a validation set to choose hyperparameter values. These include the strength of a smoothing parameter and the epoch at which to stop training (if applicable). For the gradient-based optimization, we used Ada Grad (Duchi et al., 2011) with stochastic minibatches. Unless otherwise specified, the dimension of the latent representation vectors was set to 50.