TreeCaps: Tree-Based Capsule Networks for Source Code Processing

Authors: Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang30-38

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluated on a large number of Java and C/C++ programs, Tree Caps models outperform prior deep learning models of program source code, in terms of both accuracy and robustness for program comprehension tasks such as code functionality classification and function name prediction. Our empirical evaluation shows that Tree Caps achieves better classification accuracy and better F1 score in prediction compared to other code learning techniques such as Code2vec, Code2seq, ASTNN, TBCNN, GGNN, GREAT and GNN-Fi LM. We have also applied three types of semantic-preserving transformations (Rabin et al. 2020; Zhang et al. 2020; Wang and Su 2019) that transform programs into syntactically different but semantically equivalent code to attack the models. Evaluations also show that our Tree Caps models are the most robust, able to preserve its predictions for transformed programs more than other learning techniques.
Researcher Affiliation Collaboration Nghi D. Q. Bui 1 3 Yijun Yu 1 2 Lingxiao Jiang 3 1 Trustworthy Open-Source Software Engineering Lab, Huawei Research Centre, Ireland 2 School of Computing & Communications, The Open University, UK 3 School of Computing & Information Systems, Singapore Management University
Pseudocode Yes Algorithm 1 Dynamic Routing; Algorithm 2 Variable-to-Static Capsule Routing
Open Source Code Yes Our implementation is publicly available at: https://github.com/bdqnghi/treecaps.
Open Datasets Yes The first Sorting Algorithms (SA) dataset is from Nghi, Yu, and Jiang (2019), which contains 10 algorithm classes of 1000 sorting programs written in Java. The second OJ dataset is from Mou et al. (2016), which contains 52000 C programs of 104 classes. We use the datasets from Code2seq(Alon et al. 2019a) containing three sets of Java programs: Java-Small (700k samples), Java-Med (4M samples), and Java-Large(16M samples).
Dataset Splits Yes We split each dataset into training, testing, and validation sets by the ratios of 70/20/10. These datasets have been split into training/testing/validation by projects.
Hardware Specification Yes To train the models, we use the Rectified Adam (RAdam) optimizer (Liu et al. 2019) with an initial learning rate of 0.001 subjected to decay on an Nvidia Tesla P100 GPU.
Software Dependencies No The paper mentions 'Tensorflow libraries' but does not specify a version number for TensorFlow or any other software dependencies, which is required for reproducibility.
Experiment Setup Yes For the parameters in our TBCNN layer, we follow Mou et al. (2016) to set the size of type embeddings to 128, the size of text embeddings to 128, and the number of convolutional steps m to 8. For the capsule layers, we set Nsc = 100, Dsc = 16, Dcc = 16 and routing iterations r = 3. We use Tensorflow libraries to implement Tree Caps. To train the models, we use the Rectified Adam (RAdam) optimizer (Liu et al. 2019) with an initial learning rate of 0.001 subjected to decay on an Nvidia Tesla P100 GPU.