Structured Generative Models of Natural Source Code
Authors: Chris Maddison, Daniel Tarlow
ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study the problem of building generative models of natural source code (NSC); that is, source code written by humans and meant to be understood by humans. Our primary contribution is to describe new generative models that are tailored to NSC. The models are based on probabilistic context free grammars (PCFGs) and neuro-probabilistic language models (Mnih & Teh, 2012), which are extended to incorporate additional source code-specific structure. These models can be efficiently trained on a corpus of source code and outperform a variety of less structured baselines in terms of predictive log likelihoods on held-out data.In all experiments, we used a dataset that we collected from Top Coder.com. |
| Researcher Affiliation | Collaboration | Chris J. Maddison CMADDIS@CS.TORONTO.EDU University of Toronto Daniel Tarlow DTARLOW@MICROSOFT.COM Microsoft Research |
| Pseudocode | Yes | Algorithm 1 Sampling from LTTs. |
| Open Source Code | No | The paper mentions providing 'samples of full source code files' in the Supplementary Material, but it does not state that the source code for the described methodology itself is open-source or publicly available. |
| Open Datasets | No | In all experiments, we used a dataset that we collected from Top Coder.com. The paper states they collected the dataset but does not provide a direct link, DOI, or a formal citation with author/year for accessing this specific collected dataset. |
| Dataset Splits | Yes | The overall split proportions are 20% test, 10% validation, and 70% train. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, or memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Roslyn C# compiler' and 'Ada Grad' but does not provide specific version numbers for any software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | All experiments use a validation set to choose hyperparameter values. These include the strength of a smoothing parameter and the epoch at which to stop training (if applicable). For the gradient-based optimization, we used Ada Grad (Duchi et al., 2011) with stochastic minibatches. Unless otherwise specified, the dimension of the latent representation vectors was set to 50. |