Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Co-PatcheR: Collaborative Software Patching with Component-specific Small Reasoning Models

Authors: Yuheng Tang, Hongwei Li, Kaijie Zhu, Michael Yang, Yangruibo Ding, Wenbo Guo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive evaluation, we show that Co-Patche R achieves 46% resolved rate on SWE-bench-Verified with only 3 14B models. We conduct a comprehensive ablation study to validate our recipes, as well as our choice of training data number, model size, and testing-phase scaling strategy.
Researcher Affiliation Academia Yuheng Tang1 Hongwei Li1 Kaijie Zhu1 Michael Yang1 Yangruibo Ding2 Wenbo Guo1 1University of California, Santa Barbara 2University of California, Los Angeles EMAIL EMAIL
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks. Figure 1 shows a workflow diagram, but it is not pseudocode.
Open Source Code Yes Co-Patche R ranks among the top-10 open-source systems on SWE-bench-Verified, outperforming all patchers with open-source models. (Abstract) Justification for Question 5 in NeurIPS Paper Checklist: We will open source the model and code as we said in abstract.
Open Datasets Yes We select training issues and the corresponding codebases from the SWE-bench training set and SWE-Gym dataset, which contains different repositories from our testing set. (Section 3.2.1) [22] Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? ar Xiv preprint ar Xiv:2310.06770, 2023. [36] Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym. ar Xiv preprint ar Xiv:2412.21139, 2024.
Dataset Splits Yes As introduced in Section 3.2.1, we select 2K training issues from the SWE-bench training set and the SWE-Gym [36] dataset and conduct filtering to avoid data leakage. (Section 4.1) We evaluate our system (Co-Patche R) on the SWE-bench-Verified dataset. (Section 4.1)
Hardware Specification Yes We log four kinds of metrics for an average of every issue: end-to-end wall-clock time for one instance under our testing setup(5 root causes from the localization model and 4 Po Cs from the validation model), tokens generated across all model calls (completion), latency for average one-shot model inference, and peak GPU memory. These metrics provide the efficiency baseline for Co-Patche R on two NVIDIA L40S (48 GB) cards with a VLLM deployment framework. (Appendix A)
Software Dependencies No The paper mentions using the Qwen-2.5-Coder-14B model and a VLLM deployment framework but does not provide specific version numbers for these or any other software libraries or programming languages used.
Experiment Setup Yes Table 2: Training hyper-parameters for the Qwen-2.5-Coder-14B model Hyperparameter Value Peak learning rate 1 10 5 Warmup ratio 0.10 of total steps LR scheduler Cosine decay (with 10% linear warmup) Batch size (per GPU) 1 (effective batch size = 1 12 accumulations) Weight decay 1 10 5 Number of training epochs 3.0 Maximum sequence length 32768 tokens