Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MaintainCoder: Maintainable Code Generation Under Dynamic Requirements

Authors: Zhengren Wang, Rui ling, Chufan Wang, Yongan Yu, Sizhe Wang, Zhiyu li, Feiyu Xiong, Wentao Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that existing code generation methods struggle to meet maintainability standards when requirements evolve. In contrast, Maintain Coder improves dynamic maintainability metrics by more than 60% with even higher correctness of initial codes. Furthermore, while static metrics fail to accurately reflect maintainability and even contradict each other, our proposed dynamic metrics exhibit high consistency. Our work not only provides the foundation for maintainable code generation, but also highlights the need for more realistic and comprehensive code generation research. Resources: https://github.com/IAAR-Shanghai/Maintain Coder.
Researcher Affiliation	Academia	Zhengren Wang 1,3 Rui Ling 1, Chufan Wang 1, Yongan Yu 2 Sizhe Wang1, Zhiyu Li 3, Feiyu Xiong3, Wentao Zhang 1 1Center for Data Science, Peking University 2Mc Gill University 3Center for LLM, Institute for Advanced Algorithms Research, Shanghai EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the architecture and roles of various agents (e.g., Requirements Analysis Agent, Design Pattern Selection Agent) and provides detailed prompts for these agents in Appendix A.2, but it does not include any structured pseudocode or algorithm blocks in the main body or appendices.
Open Source Code	Yes	Resources: https://github.com/IAAR-Shanghai/Maintain Coder.
Open Datasets	Yes	We introduce Maintain Bench, the first benchmark assessing code maintainability through requirement evolution cycles. Constructed through systematic extension of well-established benchmarks (Human Eval [7], MBPP [28], APPS [14], Code Contest [23], x Code Eval [22]), it incorporates diverse requirement changes with expert-curated test cases.
Dataset Splits	Yes	Maintain Bench consists of five carefully curated datasets: Human Eval-Dyn, MBPP-Dyn, APPS-Dyn, Code Contests-Dyn, and x Code Eval-Dyn, comprising over 500 Python programming data of diverse difficulty levels. Each extends established benchmarks with systematic requirement changes. ... Entry Level For entry-level difficulty, we select problems from the Human Eval and MBPP datasets, which are designed to be solvable by newbie programmers. ... We sample 30 problems from each dataset randomly, and extend them to 120 new problems by systematically modifying their requirements. ... Mixture Level ... We start with a random subset of 50 problems and expand them to over 200 new problems. ... Competition Level ... We select 30 high-difficulty problems from each dataset and extend them to over 120 new problems. ... In phase I, we generate initial code C0 for the original problem P0 using Maintain Coder or baseline methods. ... In phase II, we keep the fixed generator, e.g. GPT-4o-mini, to dynamically probe the maintainability of C0. Specifically, we generate modified codes reflecting different types of changes: C0 {Cext, Cint, Cdst, Cerr}. These variants were then evaluated to compute the dynamic maintainability metrics.
Hardware Specification	No	The paper mentions running experiments with various LLM backbones like GPT-4o-mini, Deep Seek-V3, Claude-3.5-Sonnet, etc. and refers to a 'High-performance Computing Platform of Peking University' in the acknowledgements. However, it does not provide specific details such as GPU models, CPU specifications, or memory amounts used for their experiments.
Software Dependencies	No	The paper mentions using a 'Python interpreter', 'Python library difflib', and 'Auto Gen framework [39]'. However, specific version numbers for these software components are not provided.
Experiment Setup	Yes	For all experiments, we set the generation temperature to 0.3 and topp to 0.95. For Maintain Coder, the maximum number of framework evaluation is set to 3 and the maximum number of code optimization is set to 5. We test the static metrics, including maintainability index (MI) and cyclomatic complexity (CC). ... In Phase II, given a predefined Phase II generator (GPT-4o-mini is adopted in the main experiment), the modification operation is performed as probe to calculate dynamic metrics, including Pass@5, abstract syntax tree structure similarity (ASTsim), and code similarity Codediff. The dynamic metrics like ASTsim and Codediff are averaged over five rounds. Both ASTsim and Codediff are directly calculated by calling the Python library difflib.