GraphCodeBERT: Pre-training Code Representations with Data Flow
Authors: Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie LIU, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, Ming Zhou
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve Graph Code BERT and achieves state-of-the-art performance on the four downstream tasks. |
| Researcher Affiliation | Collaboration | Daya Guo1 , Shuo Ren2 , Shuai Lu3 , Zhangyin Feng4 , Duyu Tang5, Shujie Liu5, Long Zhou5, Nan Duan5, Alexey Svyatkovskiy6, Shengyu Fu6, Michele Tufano6, Shao Kun Deng6, Colin Clement6, Dawn Drain6, Neel Sundaresan6, Jian Yin1, Daxin Jiang7, and Ming Zhou5 1School of Computer Science and Engineering, Sun Yat-sen University. 2Beihang University, 3Peking University, 4Harbin Institute of Technology, 5Microsoft Research Asia, 6Microsoft Devdiv, 7Microsoft STCA |
| Pseudocode | No | The paper does not contain any figures, blocks, or sections explicitly labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | All the codes and data are available at https://github.com/microsoft/Code BERT. |
| Open Datasets | Yes | We pre-train Graph Code BERT on the Code Search Net dataset (Husain et al., 2019), which includes 2.3M functions of six programming languages paired with natural language documents. |
| Dataset Splits | Yes | Code Search Training examples Dev queries Testing queries Candidate codes (from Table 7) and We use the Adam optimizer to update model parameters and perform early stopping on the development set. |
| Hardware Specification | Yes | We train the model on two DGX-2 machines, each having 16 NVIDIA Tesla V100 with 32GB memory. |
| Software Dependencies | No | The paper mentions software components like Transformer and Adam optimizer, but it does not specify version numbers for any programming languages, libraries, or frameworks used in the implementation (e.g., Python version, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | We set the max length of sequences and nodes as 512 and 128, respectively. We use the Adam optimizer to update model parameters with 1,024 batch size and 2e-4 learning rate. (from Appendix A) and In the fine-turning step, we set the learning rate as 2e-5, the batch size as 32, the max sequence length of queries and codes as 128 and 256, and the max number of nodes as 64. (from Appendix B). |