Retrieval-Augmented Generation for Code Summarization via Hybrid GNN
Authors: Shangqing Liu, Yu Chen, Xiaofei Xie, Jing Kai Siow, Yang Liu
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the proposed approach, we release a new challenging benchmark, crawled from diversified large-scale open-source C projects (total 95k+ unique functions in the dataset). Our method achieves the state-of-the-art performance, improving existing methods by 1.42, 2.44 and 1.29 in terms of BLEU-4, ROUGE-L and METEOR. |
| Researcher Affiliation | Academia | Shangqing Liu1 , Yu Chen2 , Xiaofei Xie1 , Jingkai Siow1, Yang Liu1 1 Nanyang Technology University 2 Rensselaer Polytechnic Institute |
| Pseudocode | No | The paper does not contain a dedicated pseudocode block or algorithm section. |
| Open Source Code | Yes | We also release a new code summarization benchmark by crawling data from popular and diversified projects containing 95k+ functions in C programming language and make it public 1. https://github.com/shangqing-liu/CCSD-benchmark-for-code-summarization |
| Open Datasets | Yes | We are the first to explore neural summarization on C programming language, and make our C Code Summarization Dataset (CCSD) public to benefit academia and industry. |
| Dataset Splits | Yes | Finally, we obtain 84,316 training functions, 4,432 in-domain validation functions, 4,203 in-domain test functions and 2,330 out-of-domain test functions. |
| Hardware Specification | Yes | All experiments are conducted on the DGX server with four Nvidia Graphics Tesla V100 and each epoch takes 6 minutes averagely. |
| Software Dependencies | No | The paper mentions models like Bi LSTM, GRU, and LSTM, but does not provide specific version numbers for software libraries or dependencies used (e.g., PyTorch version, Python version). |
| Experiment Setup | Yes | We embed the most frequent 40,000 words in the training set with 512-dims and set the hidden size of Bi LSTM to 256 and the concatenated state size for both directions is 512. The dropout is set to 0.3 after the word embedding layer and Bi LSTM. We set GNN hops to 1 for the best performance. The optimizer is selected with Adam with an initial learning rate of 0.001. The batch size is set to 64 and early stop for 10. The beam search width is set to 5 as usual. |