Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

More Than Just Functional: LLM-as-a-Critique for Efficient Code Generation

Authors: Derui Zhu, Dingfan Chen, jinfu chen, Jens Grossklags, Alexander Pretschner, Weiyi Shang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on benchmark datasets (Effi Bench, Human Eval+, COFFE, Mercury) across multiple representative code models demonstrate up to a 70.6% reduction in average execution time and a 13.6% decrease in maximum memory usage, highlighting the computational efficiency and practicality of our approach compared to existing alternatives.
Researcher Affiliation	Academia	1Technical University of Munich 2Max Planck Institute for Intelligent Systems 3Wuhan University 4University of Waterloo
Pseudocode	No	The paper describes the methodology in prose in Sections 3.1 and 3.2. Section B, 'Static AST Patterns', provides code snippets demonstrating AST patterns, but these are examples and not the pseudocode for the main algorithm. There are no clearly labeled 'Pseudocode' or 'Algorithm' blocks for the overall method.
Open Source Code	Yes	We provide the implementation at the following link: https://github.com/ hitum-dev/Fastdecoder.
Open Datasets	Yes	We conduct experiments on four recent standard code benchmark datasets: Effi Bench [23], a benchmark comprising 1,000 efficiency-critical Leet Code coding problems paired with human-written canonical solutions, filtered to 988 samples with verified correct test cases; Human Eval+ [31], an extension of Human Eval [7] with 164 human-written Python programming tasks with expanded test coverage for rigorous functional correctness evaluation; Mercury [14], a dataset of 1,889 Python tasks with test case generators and difficulty annotations derived from solution runtimes; and COFFE [39], a code generation benchmark with 398 and 358 problems for function-level and file-level code generation, respectively.
Dataset Splits	No	We conduct experiments on four recent standard code benchmark datasets: Effi Bench [23], Human Eval+ [31], Mercury [14], and COFFE [39]. Correctness is evaluated as the proportion of test samples successfully passing all test cases. The paper describes evaluating pre-trained LLMs on test portions of these benchmarks, but does not specify any explicit training/validation splits used in the authors' experiments for these models, nor specific percentages or counts for splitting these datasets into training, validation, and test sets.
Hardware Specification	Yes	The code generation experiments were conducted on a SLURM-managed computing cluster equipped with 16 NVIDIA A100 Tensor Core GPUs (80GB memory each), interconnected via NVLink 3.0 technology. Each compute node featured 512GB memory. For code performance evaluation, all measurements were conducted in isolated environments. Each test was run on a separate virtual machine instance with identical configurations to minimize system-level variability: a dedicated CPU-only node was deployed containing dual Intel Xeon E5-2695 v4 processors (36 threads total @ 2.1GHz base frequency) with 512GB DDR4-2400 memory.
Software Dependencies	No	To measure performance, we profile execution time and memory usage using Line Profiler6 and Memory Profiler7. While specific tools are mentioned, version numbers for these or other software dependencies are not explicitly provided.
Experiment Setup	Yes	For the search strategies, we set a default search space of 50 (n=50 for best-of-n) and p=0.95 for nucleus (top-p) sampling following common practice [36]. We set a maximum limit of new tokens to 256 to enforce a stopping criterion for token generation. The temperature is set to 1 by default. Our method is implemented using beam search and best-of-n selection, with a default beam width b = 1 and number of trials n = 50. Table 5 summarizes the default hyperparameter settings for the different configurations of our method used in the ablation study. The default configuration, corresponding to the main paper results, uses the composite scoring function α r AST+β r LLM+γ PP with b = 1 and n = 50.