Self-Infilling Code Generation
Authors: Lin Zheng, Jianbo Yuan, Zhi Zhang, Hongxia Yang, Lingpeng Kong
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across a variety of code generation benchmarks demonstrate that decoding with self-infilling not only improves the output quality but also regularizes the overall generation, which effectively mitigates potential degeneration and scaffolds code to be more consistent with intended functionality. |
| Researcher Affiliation | Collaboration | 1The University of Hong Kong 2Byte Dance Inc. |
| Pseudocode | Yes | Algorithm 1 Looping Mechanism, Algorithm 2 Self-infilling Interruption, Algorithm 3 Left-to-right Generation, Figure 5. Python pseudo-code implementation of the parsing function for the left-to-right generation. |
| Open Source Code | No | The paper states that they utilize "open-sourced STARCODER (Li et al., 2023) and CODE LLAMA (Roziere et al., 2023) models", but it does not provide a link or explicit statement about releasing the source code for their own proposed methodology. |
| Open Datasets | Yes | Our evaluation encompasses a range of code generation benchmarks, including HUMANEVAL (Chen et al., 2021), MBPP (Austin et al., 2021), and DS1000 (Lai et al., 2023). In addition, we also extend our analysis to multilingual code generation with MULTIPLE (Cassano et al., 2023) and mathematical reasoning with GSM8K (Cobbe et al., 2021) |
| Dataset Splits | No | The paper specifies evaluation on existing benchmarks' test sets (e.g., MBPP includes 500 crowd-sourced basic Python programming problems as the test set), but it does not provide explicit training/validation/test dataset splits for these benchmarks in the context of their own experimental setup for reproducibility. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions using specific models like STARCODER and CODE LLAMA, but it does not provide specific version numbers for any ancillary software dependencies (e.g., programming languages, libraries, frameworks). |
| Experiment Setup | Yes | We set the maximum context length to 2048 tokens and limit the maximum number of generated tokens to 512, except for the HUMANEVAL benchmark, where we limit the context length to 640 for accelerating decoding. For self-infilling generation, τ and N are defaulted to 0.25 and 2, respectively, unless otherwise specified. Pass@1 rates are calculated via greedy decoding, while pass@10 and pass@100 are computed by generating 200 samples at temperature 0.8 using nucleus sampling (Holtzman et al., 2020) with top-p = 0.95. |