CERT: Continual Pre-training on Sketches for Library-oriented Code Generation

Authors: Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, Jian-Guang Lou

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the impressive performance of CERT. For example, it surpasses the base model by an absolute 15.67% improvement in terms of pass@1 on Pandas Eval. Our work is available at https://github.com/microsoft/PyCodeGPT. We perform extensive experiments on CERT. Results indicate that CERT has superior performance on library-oriented code generation.
Researcher Affiliation Collaboration Daoguang Zan1,2 , Bei Chen3 , Dejian Yang3 , Zeqi Lin3 , Minsu Kim4 , Bei Guan2,5 , Yongji Wang2,5,6 , Weizhu Chen7 , Jian-Guang Lou3 1Cooperative Innovation Center, Institute of Software, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Microsoft Research Asia 4Korea University 5Integrative Innovation Center, Institute of Software, Chinese Academy of Sciences 6State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences 7Microsoft Azure AI
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks that are clearly labeled as such.
Open Source Code Yes Our work is available at https://github.com/microsoft/PyCodeGPT.
Open Datasets Yes We summarize the three key points that make PYCODEGPT powerful: 1) a large amount of carefully cleaned data for pre-training; 2) a newly trained tokenizer, which is specialized in python; and 3) a resampling strategy that prioritizes high-quality data. Besides PYCODEGPT, we also regard CODEGEN (MONO 350M) [Nijkamp et al., 2022] as one of our base models, which is by far the best performing publicly available model on Human Eval4.
Dataset Splits No The paper mentions creating test cases for evaluation and using pass@k metrics, but it does not provide explicit details about training/validation/test splits, such as percentages, counts, or specific predefined split methods for the datasets used.
Hardware Specification Yes PYCODEGPT is pre-trained for 200K steps and 100B tokens on a cluster of 16 NVIDIA V100 GPUs with 32GB memory. We pre-train the model for 100K steps on a cluster of 8 NVIDIA V100 GPUs with 32GB memory.
Software Dependencies No We implement our approach using Py Torch [Paszke et al., 2019], Hugging Face s transformers library [Wolf et al., 2019], and Deep Speed7. The paper mentions the software libraries used but does not provide specific version numbers for them.
Experiment Setup Yes In the training phase of PYCODEGPT, we set the batch size to 10, the window size to 1024, the learning rate to 5e-4, the gradient accumulation steps to 4 and the weight decay to 0.1. The settings of sketcher and generator are the same as PYCODEGPT.