Bimodal Modelling of Source Code and Natural Language

Authors: Miltos Allamanis, Daniel Tarlow, Andrew Gordon, Yi Wei

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate their performance on two retrieval tasks: retrieving source code snippets given a natural language query, and retrieving natural language descriptions given a source code query (i.e., source code captioning). Experiments show there to be promise in this direction, and that modelling the structure of source code improves performance.
Researcher Affiliation Collaboration Miltiadis Allamanis M.ALLAMANIS@ED.AC.UK School of Informatics University of Edinburgh, Edinburgh, EH8 9AB, United Kingdom Daniel Tarlow DTARLOW@MICROSOFT.COM Andrew D. Gordon ADG@MICROSOFT.COM Yi Wei YIWE@MICROSOFT.COM Microsoft Research, 21 Station Road, Cambridge, CB1 2FB, United Kingdom
Pseudocode No The paper describes the model generation process verbally but does not include a formal pseudocode block or algorithm.
Open Source Code No The paper does not provide any statement or link indicating that the source code for the methodology described is publicly available.
Open Datasets Yes We extract all questions and answers tagged with the C# tag and use the title of the question as the natural language query and the code snippets in the answers as the target source code. ... Stack Overflow data is freely available online though the Stack Exchange Data Explorer. ... We scraped the site [Dot Net Perls] for code snippets along with the natural language captions they are associated with.
Dataset Splits Yes For each of the evaluation datasets, we create three distinct sets: the trainset that contains 70% of the code snippets, the test1 set that contains the same snippets as the trainset but novel natural language queries (if any) and the test2 set that contains the remaining 30% of the snippets with their associated natural language queries.
Hardware Specification No The paper does not specify any particular hardware (e.g., GPU, CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions 'Roslyn(.NET Compiler Platform)' for parsing C# code and 'Ada Grad' for optimization but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We then train (D = 20, 100 iterations) and evaluate the logbilinear models on the synthetic data. ... For optimization, we use Ada Grad (Duchi et al., 2011). We initialize the biases bi,v to the noise PCFG distribution such that bi,v = log Pnoise(v | i). The rest of the representations are initialized randomly around a central number with some small additive noise. li components are initialized with center 0, cφi components centered at 1 when using the multiplicative model or centered at 0 for the additive model and the diagonals of Hi at 1. ... D = 50 for all models.