Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Can Agent Fix Agent Issues?

Authors: Alfin Wijaya Rahardja, Junwei Liu, Weitong Chen, Zhenpeng Chen, Yiling Lou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We further evaluate multiple state-of-the-art SE agents (i.e., Agentless [52], Auto Code Rover [66], and SWE-agent [58]) with both GPT-4o [1] and Claude-3.5-Sonnet [15] on AGENTISSUE-BENCH. We find that all of the existing SE agents exhibit limited capabilities in resolving agent issues. For instance, only 0.67% to 4.67% of agent issues are correctly resolved, which is significantly lower than the resolution rates achieved when these SE agents are applied to traditional software (e.g., 23.20% 50.80% resolution rate [33]).
Researcher Affiliation Academia Alfin Wijaya Rahardja Fudan University EMAIL Junwei Liu Fudan University EMAIL Weitong Chen Fudan University EMAIL Zhenpeng Chen Nanyang Technological University EMAIL Yiling Lou University of Illinois Urbana-Champaign EMAIL
Pseudocode No The paper describes methodologies and processes (e.g., issue reproduction procedure in Section 4.1) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Data and code are available at https://github.com/alfin06/Agent Issue-Bench.
Open Datasets Yes We then spend 500 person-hours constructing AGENTISSUE-BENCH, a reproducible benchmark comprising 50 agent issue resolution tasks (each with an executable environment and failure-triggering tests). Data and code are available at https://github.com/alfin06/Agent Issue-Bench.
Dataset Splits Yes We randomly separate our collected 201 agent issues into (i) 171 issues (85%) for building the taxonomy and (ii) 30 issues (15%) for evaluating our constructed taxonomy.
Hardware Specification No Our experiments solely on online LLMs and thus do not impose strict requirements on computational resources
Software Dependencies No We directly adopt their released implementation with the original hyperparameter settings. Backbone LLMs. Based on the recent SWE leaderboard [33], state-of-the-art SE agents achieve higher fixing rate on general software issues when equipped with backbone LLMs GPT-4o [1] and Claude-3.5 Sonnet [15]. These references are to commercial LLM APIs, not specific versioned software dependencies for running the experiments locally.
Experiment Setup Yes We directly adopt their released implementation with the original hyperparameter settings. Backbone LLMs. ...we mainly study how effective SE agents are in resolving agent issues with these two backbone LLMs (temperature = 0). To eliminate the randomness from LLMs, we repeat all experiments three times and present the average results.