Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents

Authors: Yifu Guo, Jiaye Lin, Huacan Wang, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate SE-Agent on SWE-bench Verified to resolve real-world Git Hub issues. Experimental results across five strong LLMs show that integrating SE-Agent delivers up to 55% relative improvement, achieving state-of-the-art performance among all open-source agents on SWE-bench Verified (61.2% with Claude-3.7-Sonnet, 80.0% with Claude-4-Sonnet1).
Researcher Affiliation	Academia	Yifu Guo1,2 Jiaye Lin3 Huacan Wang4 Yuzhen Han5 Sen Hu6 Ziyi Ni4,7 Licheng Wang4 Mingguang Chen8 1Sun Yat-sen University, 2Step Fun, 3Tsinghua University, 4University of Chinese Academy of Sciences, 5University of Toronto, 6Peking University, 7Institute of Automation, Chinese Academy of Sciences, 8University of California, Riverside
Pseudocode	No	The paper describes the operations (Revision, Recombination, Refinement) conceptually and provides detailed prompt structures for these operations in the Appendix, but it does not include formally labeled 'Pseudocode' or 'Algorithm' blocks for the overall framework or its components.
Open Source Code	Yes	R JARVIS-Xs/SE-Agent Justification: The code to reproduce the experimental results are provided on anonymous link.
Open Datasets	Yes	Benchmark In our experiments, we utilize SWE-bench Verified, a curated subset of the broader SWE-bench, consisting of 500 real-world Git Hub issues. This benchmark is meticulously designed to provide a self-contained and controlled environment for evaluating framework performance, with a specific focus on functional bug fixes. Each instance in the benchmark includes a natural language description of a Git Hub issue and its corresponding code repository, serving as the sole input to the model. To ensure the rigor of evaluation, developer-written unit tests are employed to verify the correctness of model-generated patches. This combination of real-world scenarios and systematic validation establishes SWE-bench Verified as a robust and consistent benchmark for assessing the effectiveness of automated bug-fixing systems.
Dataset Splits	Yes	In our experiments, we utilize SWE-bench Verified, a curated subset of the broader SWE-bench, consisting of 500 real-world Git Hub issues. This benchmark is meticulously designed to provide a self-contained and controlled environment for evaluating framework performance, with a specific focus on functional bug fixes. Each instance in the benchmark includes a natural language description of a Git Hub issue and its corresponding code repository, serving as the sole input to the model.
Hardware Specification	Yes	For deployment, we run all open-source models locally, including Deep Seek-V3-0324, Qwen-2.572B-Instruct, and LLa MA-3.1-70B-Instruct, using NVIDIA A100 GPUs with 80GB of memory.
Software Dependencies	No	The paper lists specific LLM models used (Deep Seek-V3-0324, Qwen-2.5-72b-Instruct, Llama-3.1-70b-Instruct, GPT-4o, Claude-3.7-Sonnet) but does not provide version numbers for general ancillary software like Python, PyTorch, or other libraries/frameworks used for implementing the SE-Agent framework itself.
Experiment Setup	Yes	To ensure a fair comparison, we adopt identical prompt formats across all models evaluated in this paper. In our proposed SE-Agent framework, we set the number of candidate trajectories to 10 by default, striking a balance between exploration diversity and computational efficiency.