CodeTrek: Flexible Modeling of Code using an Extensible Relational Representation
Authors: Pardis Pashakhanloo, Aaditya Naik, Yuepeng Wang, Hanjun Dai, Petros Maniatis, Mayur Naik
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Code Trek on four diverse and challenging Python tasks: variable misuse, exception prediction, unused definition, and variable shadowing. Code Trek achieves an accuracy of 91%, 63%, 98%, and 94% on these tasks respectively, and outperforms state-of-the-art neural models by 2-19% points. |
| Researcher Affiliation | Collaboration | Pardis Pashakhanloo University of Pennsylvania Aaditya Naik University of Pennsylvania Yuepeng Wang Simon Fraser University Hanjun Dai Google Research Petros Maniatis Google Research Mayur Naik University of Pennsylvania |
| Pseudocode | Yes | Algorithm 1 (Code2Rel) Given a program P, a set of base relation names RB, and a set of derived relation names RQ, construct and return database D. Algorithm 2 (Rel2Graph) Given a database D, construct a program graph G. Algorithm 3 (Graph2Walks) Given a program graph G, a walk specification S = C, B, min, max , and the number of walks w, sample a set of walks W. Algorithm 4 (Code2Walks) Given a program P and a task specification T = RB, RQ, S, n , generate a set of walks W. |
| Open Source Code | Yes | CODETREK is publicly available at https://github.com/ppashakhanloo/Code Trek. |
| Open Datasets | Yes | We use the ETH Py150 Open corpus consisting of 125K Python modules1. 1https://github.com/google-research-datasets/eth_py150_open |
| Dataset Splits | Yes | Table 9: The number of samples used for training, validation, and testing and the lines of code that they contain. |
| Hardware Specification | No | The paper mentions '8 GPUs for distributed synchronized SGD training' but does not specify the model or type of these GPUs or any other specific hardware components. |
| Software Dependencies | No | The paper lists frameworks and models used (e.g., Cu BERT, GREAT, Code2Seq, GGNN) and mentions 'tensor2tensor package', but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | In this section, we describe details on parameters and hyperparameters we used. Code Trek We train CODETREK models with a learning rate of 10 4, 4 transformer layers, an embedding size of 256, 8 attention heads, and 512 hidden units. We sample 100 walks with lengths of up to 24 in each graph for every task, except for the VARMISUSE-FUN task for which we sample 500 such walks per graph. |