reproducibilityindex.ai

Self-Infilling Code Generation

Authors: Lin Zheng, Jianbo Yuan, Zhi Zhang, Hongxia Yang, Lingpeng Kong

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across a variety of code generation benchmarks demonstrate that decoding with self-infilling not only improves the output quality but also regularizes the overall generation, which effectively mitigates potential degeneration and scaffolds code to be more consistent with intended functionality.
Researcher Affiliation	Collaboration	1The University of Hong Kong 2Byte Dance Inc.
Pseudocode	Yes	Algorithm 1 Looping Mechanism, Algorithm 2 Self-infilling Interruption, Algorithm 3 Left-to-right Generation, Figure 5. Python pseudo-code implementation of the parsing function for the left-to-right generation.
Open Source Code	No	The paper states that they utilize "open-sourced STARCODER (Li et al., 2023) and CODE LLAMA (Roziere et al., 2023) models", but it does not provide a link or explicit statement about releasing the source code for their own proposed methodology.
Open Datasets	Yes	Our evaluation encompasses a range of code generation benchmarks, including HUMANEVAL (Chen et al., 2021), MBPP (Austin et al., 2021), and DS1000 (Lai et al., 2023). In addition, we also extend our analysis to multilingual code generation with MULTIPLE (Cassano et al., 2023) and mathematical reasoning with GSM8K (Cobbe et al., 2021)
Dataset Splits	No	The paper specifies evaluation on existing benchmarks' test sets (e.g., MBPP includes 500 crowd-sourced basic Python programming problems as the test set), but it does not provide explicit training/validation/test dataset splits for these benchmarks in the context of their own experimental setup for reproducibility.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper mentions using specific models like STARCODER and CODE LLAMA, but it does not provide specific version numbers for any ancillary software dependencies (e.g., programming languages, libraries, frameworks).
Experiment Setup	Yes	We set the maximum context length to 2048 tokens and limit the maximum number of generated tokens to 512, except for the HUMANEVAL benchmark, where we limit the context length to 640 for accelerating decoding. For self-infilling generation, τ and N are defaulted to 0.25 and 2, respectively, unless otherwise specified. Pass@1 rates are calculated via greedy decoding, while pass@10 and pass@100 are computed by generating 200 samples at temperature 0.8 using nucleus sampling (Holtzman et al., 2020) with top-p = 0.95.