reproducibilityindex.ai

LongCoder: A Long-Range Pre-trained Language Model for Code Completion

Authors: Daya Guo, Canwen Xu, Nan Duan, Jian Yin, Julian Mcauley

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on a newly constructed dataset that contains longer code context and the publicly available Code XGLUE benchmark. Experimental results demonstrate that Long Coder achieves superior performance on code completion tasks compared to previous models while maintaining comparable efficiency in terms of computational resources during inference.
Researcher Affiliation	Collaboration	1Sun Yat-sen University 2University of California, San Diego 3Microsoft Research Asia.
Pseudocode	No	The paper describes the model architecture and attention mechanisms using mathematical equations but does not present pseudocode or a clearly labeled algorithm block.
Open Source Code	Yes	All the codes and data are available at https://github. com/microsoft/Code BERT.
Open Datasets	Yes	To evaluate the effectiveness of Long Coder and encourage future research on Long Code Completion, we construct a new dataset called LCC... Specifically, we construct our datasets from the github-code2 dataset, which contains a vast number of code files sourced from Git Hub with an open-source license that permits research use. (https://huggingface.co/datasets/codeparrot/github-code)
Dataset Splits	Yes	For each programming language, we sample 100k examples for training, and 10k examples for development and 10k for testing.
Hardware Specification	Yes	The inference memory consumption and runtime per example are calculated using a beam search with beam size of 5 and maximum generation length of 64 on a single V100 GPU.
Software Dependencies	No	The paper mentions 'Adam optimizer' and 'tree-sitter' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	During finetuning, we use the Adam optimizer with a batch size of 16 and a learning rate of 2e-4. We fine-tune the model for 10 epochs and perform early stopping on the development set.