Self-Infilling Code Generation

Authors: Lin Zheng, Jianbo Yuan, Zhi Zhang, Hongxia Yang, Lingpeng Kong

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across a variety of code generation benchmarks demonstrate that decoding with self-infilling not only improves the output quality but also regularizes the overall generation, which effectively mitigates potential degeneration and scaffolds code to be more consistent with intended functionality.
Researcher Affiliation Collaboration 1The University of Hong Kong 2Byte Dance Inc.
Pseudocode Yes Algorithm 1 Looping Mechanism, Algorithm 2 Self-infilling Interruption, Algorithm 3 Left-to-right Generation, Figure 5. Python pseudo-code implementation of the parsing function for the left-to-right generation.
Open Source Code No The paper states that they utilize "open-sourced STARCODER (Li et al., 2023) and CODE LLAMA (Roziere et al., 2023) models", but it does not provide a link or explicit statement about releasing the source code for their own proposed methodology.
Open Datasets Yes Our evaluation encompasses a range of code generation benchmarks, including HUMANEVAL (Chen et al., 2021), MBPP (Austin et al., 2021), and DS1000 (Lai et al., 2023). In addition, we also extend our analysis to multilingual code generation with MULTIPLE (Cassano et al., 2023) and mathematical reasoning with GSM8K (Cobbe et al., 2021)
Dataset Splits No The paper specifies evaluation on existing benchmarks' test sets (e.g., MBPP includes 500 crowd-sourced basic Python programming problems as the test set), but it does not provide explicit training/validation/test dataset splits for these benchmarks in the context of their own experimental setup for reproducibility.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions using specific models like STARCODER and CODE LLAMA, but it does not provide specific version numbers for any ancillary software dependencies (e.g., programming languages, libraries, frameworks).
Experiment Setup Yes We set the maximum context length to 2048 tokens and limit the maximum number of generated tokens to 512, except for the HUMANEVAL benchmark, where we limit the context length to 640 for accelerating decoding. For self-infilling generation, τ and N are defaulted to 0.25 and 2, respectively, unless otherwise specified. Pass@1 rates are calculated via greedy decoding, while pass@10 and pass@100 are computed by generating 200 samples at temperature 0.8 using nucleus sampling (Holtzman et al., 2020) with top-p = 0.95.