Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Execution Guided Line-by-Line Code Generation

Authors: Boaz Lavon, Shahar Katz, Lior Wolf

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments across diverse coding tasks demonstrate that EG-CFG significantly improves code generation performance compared to standard approaches, achieving state-of-the-art results across various levels of complexity, from foundational problems to challenging competitive programming and data science tasks.
Researcher Affiliation	Academia	Boaz Lavon Shahar Katz Lior Wolf Blavatnik School of Computer Science and AI, Tel Aviv University {boazlavon@mail, shaharkatz3@mail, wolf@cs}.tau.ac.il
Pseudocode	Yes	The full pseudo-code for both the execution-guided inference loop and the multi-agent controller is provided in Appendix C.
Open Source Code	Yes	Our code is available at: https://github.com/boazlavon/eg_cfg
Open Datasets	Yes	In this work, we focus on six code generation benchmarks: MBPP [3], Human Eval [1], DS-1000 [31] and Code Contests [4], along with the extended variants MBPP-ET and Human Eval-ET [32].
Dataset Splits	Yes	Evaluation Benchmark Our evaluations use widely-adopted benchmarks: MBPP [3] (500 tasks) and Human Eval [1] (164 tasks), along with their extended test versions MBPP-ET and Human Eval ET [32]. To assess performance on more challenging tasks, we also evaluate on the DS-1000 data science benchmark [31] (1000 tasks) and the Code Contests competitive programming benchmark [4] (using the Exec Eval framework [42]). We report accuracy: the percentage of problems passing all test cases. To rigorously test generalization and prevent overfitting to public tests, evaluations on Human Eval, Human Eval-ET, MBPP-ET, Code Contests, and DS-1000 rely on hidden test cases inaccessible during inference.
Hardware Specification	Yes	We conduct our experiments using two LLMs across different parameter scales: Deep Seek-Coder-1.3B [41], which is small enough to run locally on our machines (NVIDIA Ge Force RTX 2080 Ti and RTX 3090 GPUs), and a large open-source model, Deep Seek-V3-0324 [24], which we use through a cloud inference endpoint.
Software Dependencies	No	The paper mentions using specific LLMs (Deep Seek-Coder-1.3B, Deep Seek-V3-0324) and refers to Python for code. However, it does not provide specific version numbers for ancillary software dependencies like programming languages, libraries, or frameworks (e.g., Python version, PyTorch/TensorFlow versions).
Experiment Setup	Yes	Hyperparameter Settings As explained in section 3.4, our method launches multiple parallel agents for each task. Each agent is assigned a different hyperparameter configuration. The following hyper-parameter sets were used in our experiments: s = 3, t {0.7, 0.75, 0.85, 0.95, 1.2, 1.5}, d {2, 3, 6, 8}, γ {0, 0.5, 1, 3}. Additionally, we evaluate both p0 prompt templates, see section 3 and appendix Appendix A.