Exploiting Code Symmetries for Learning Program Semantics
Authors: Kexin Pei, Weichen Li, Qirui Jin, Shuyang Liu, Scott Geng, Lorenzo Cavallaro, Junfeng Yang, Suman Jana
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our solution, SYMC, develops a novel variant of self-attention that is provably equivariant to code symmetries from the permutation group defined over the program dependence graph. SYMC obtains superior performance on five program analysis tasks, outperforming stateof-the-art code models, including GPT-4, without any pre-training. Our results suggest that code LLMs that encode the code structural prior via the code symmetry group generalize better and faster. |
| Researcher Affiliation | Academia | 1Columbia University 2The University of Chicago 3University of Michigan 4Huazhong University of Science and Technology 5University of Washington 6University College London. Correspondence to: Kexin Pei<kpei@cs.uchicago.edu>, Suman Jana <suman@cs.columbia.edu>. |
| Pseudocode | No | The paper describes algorithms and operations using text and mathematical notation, but it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the SYMC methodology or a link to a code repository. |
| Open Datasets | Yes | We use the Java dataset collected by Allamanis et al. (2016) to evaluate the function name prediction. The dataset includes 11 Java projects, such as Hadoop, Gradle, etc., totaling 707K methods and 5.6M statements. We fix Hadoop as our test set and use the other projects for training, to ensure the two sets do not overlap. For defect prediction, we obtain the dataset from Defects4J (Just et al., 2014). We collect and compile 27 open-source projects, such as Open SSL, Image Magic, Core Utils, SQLite, etc. |
| Dataset Splits | No | The paper mentions '14K/6K training/testing samples', which indicates a train/test split. However, it does not explicitly provide details for a separate validation split or a complete three-way (train/validation/test) split. |
| Hardware Specification | Yes | We conduct all the experiments on three Linux servers with Ubuntu 20.04 LTS, each featuring an AMD EPYC 7502 processor, 128 virtual cores, and 256GB RAM, with 12 Nvidia RTX 3090 GPUs in total. |
| Software Dependencies | No | The paper states, 'We implement SYMC using Fairseq (Ott et al., 2019) PyTorch (Paszke et al., 2019)' and mentions using 'Ghidra'. While specific software is named with citations, explicit version numbers for PyTorch, Fairseq, or Ghidra are not provided in the text. |
| Experiment Setup | Yes | We use SYMC with 8 attention layers, 12 attention heads, and a maximum input length of 512. For training, we use 10 epochs, a batch size of 64, and 14K/6K training/testing samples (strictly non-overlapping) unless stated otherwise. We employ 16-bit weight parameters for SYMC to optimize for memory efficiency. |