InCoder: A Generative Model for Code Infilling and Synthesis

Authors: Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Scott Yih, Luke Zettlemoyer, Mike Lewis

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate performance on a range of zero-shot code infilling tasks (Sec. 4), both new and from existing work, including challenging use cases such as type prediction, variable re-naming, comment generation, and completing missing lines of code. Zero-shot infilling with bidirectional context substantially outperforms approaches based on left-to-right-only models, and on several tasks obtains performance comparable to state-of-the-art models fine-tuned on the tasks. Ablation experiments (Sec. 5) show that this does not come at the cost of left-to-right generation ability; our causal masking model achieves similar performance to a standard language model on program synthesis benchmarks (Chen et al., 2021a; Austin et al., 2021) despite its more general training objective.
Researcher Affiliation Collaboration Facebook AI Research University of Washington UC Berkeley TTI-Chicago Carnegie Mellon University
Pseudocode No The paper mentions "pseudocode" in Section 6 and 7 as a related concept or prior work, but does not include any pseudocode or algorithm blocks within its own content.
Open Source Code Yes Our models and code are publicly released.1 1https://sites.google.com/view/incoder-code-models/
Open Datasets Yes We collect a corpus of (1) public code with permissive, non-copyleft, open-source licenses from Git Hub and Git Lab and (2) Stack Overflow questions, answers, and comments. ... We decontaminate our pre-training corpus by removing all datasets which we use in our evaluation experiments. See Section A.1 for details. Our final pre-training corpus contains a total of 159 GB of code, 52 GB of it in Python, and a total of 57 GB of content from Stack Overflow. ... We create an infilling benchmark for complete lines of code from the Human Eval dataset (Chen et al., 2021a). ... We use the Code XGLUE code-to-text docstring generation task (Lu et al., 2021), which is constructed from Code Search Net (Husain et al., 2019).
Dataset Splits Yes We compare model pass@1 scores on the Human Eval (Chen et al., 2021a) and MBPP (Austin et al., 2021) left-to-right synthesis benchmarks... For Human Eval, we follow past work by prompting with function signatures and docstring descriptions, sample 200 candidate program completions, and compute pass@1, pass@10, and pass@100 using the unbiased sampling estimator of Chen et al. (Chen et al., 2021a).
Hardware Specification Yes INCODER-6.7B was trained on 248 V100 GPUs for 24 days.
Software Dependencies No The paper mentions software like Fairseq and PyTorch, but does not provide specific version numbers for these or other key software components used in the experiments. For example, "Our implementation utilized the causal masking implementation (Aghajanyan et al., 2022a) available in Fairseq (Ott et al., 2019), with the underlying library being Py Torch (Paszke et al., 2019)." does not specify the version of Fairseq or PyTorch used.
Experiment Setup Yes For all three inference methods, we obtain generations from the model using top-p (nucleus) sampling (Holtzman et al., 2020) with p = 0.95 and a temperature tuned for each task and inference method using the task s development data. ... Our per GPU batch size was 8, with a maximum token sequence length of 2048. We clip all gradient norms to 1.0 and used the Adam optimizer with β1 = 0.9, β2 = 0.98 (Kingma & Ba, 2015). For our learning rate scheduler, we use the built-in polynomial decay learning rate scheduler available in Paszke et al. (2019) with 1500 warmup updates. ... For the L-R single and CM infilling methods, we sample using a temperature of 0.2. For the L-R rerank method, we use a temperature of 0.8 to sample K = 10 candidates and rescore with the total log probability of the completed function.