CodeTrek: Flexible Modeling of Code using an Extensible Relational Representation

Authors: Pardis Pashakhanloo, Aaditya Naik, Yuepeng Wang, Hanjun Dai, Petros Maniatis, Mayur Naik

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Code Trek on four diverse and challenging Python tasks: variable misuse, exception prediction, unused definition, and variable shadowing. Code Trek achieves an accuracy of 91%, 63%, 98%, and 94% on these tasks respectively, and outperforms state-of-the-art neural models by 2-19% points.
Researcher Affiliation Collaboration Pardis Pashakhanloo University of Pennsylvania Aaditya Naik University of Pennsylvania Yuepeng Wang Simon Fraser University Hanjun Dai Google Research Petros Maniatis Google Research Mayur Naik University of Pennsylvania
Pseudocode Yes Algorithm 1 (Code2Rel) Given a program P, a set of base relation names RB, and a set of derived relation names RQ, construct and return database D. Algorithm 2 (Rel2Graph) Given a database D, construct a program graph G. Algorithm 3 (Graph2Walks) Given a program graph G, a walk specification S = C, B, min, max , and the number of walks w, sample a set of walks W. Algorithm 4 (Code2Walks) Given a program P and a task specification T = RB, RQ, S, n , generate a set of walks W.
Open Source Code Yes CODETREK is publicly available at https://github.com/ppashakhanloo/Code Trek.
Open Datasets Yes We use the ETH Py150 Open corpus consisting of 125K Python modules1. 1https://github.com/google-research-datasets/eth_py150_open
Dataset Splits Yes Table 9: The number of samples used for training, validation, and testing and the lines of code that they contain.
Hardware Specification No The paper mentions '8 GPUs for distributed synchronized SGD training' but does not specify the model or type of these GPUs or any other specific hardware components.
Software Dependencies No The paper lists frameworks and models used (e.g., Cu BERT, GREAT, Code2Seq, GGNN) and mentions 'tensor2tensor package', but does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes In this section, we describe details on parameters and hyperparameters we used. Code Trek We train CODETREK models with a learning rate of 10 4, 4 transformer layers, an embedding size of 256, 8 attention heads, and 512 hidden units. We sample 100 walks with lengths of up to 24 in each graph for every task, except for the VARMISUSE-FUN task for which we sample 500 such walks per graph.