SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Authors: John Yang, Carlos Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, Ofir Press

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate SWE-agent on SWE-bench and Human Eval Fix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively, far exceeding the previous state-of-the-art achieved with non-interactive LMs. Finally, we provide insight on how the design of the ACI can impact agents behavior and performance.
Researcher Affiliation Academia John Yang Carlos E. Jimenez Alexander Wettig Kilian Lieret Shunyu Yao Karthik Narasimhan Ofir Press Princeton Language and Intelligence, Princeton University Equal contribution. Correspondence to johnby@stanford.edu, carlosej@princeton.edu.
Pseudocode Yes Figure 13: The skeleton code for defining a command that can be accessed in the SWE-agent ACI. The function definition includes both the underlying implementation along with several arguments that describe how to use the command, which is compiled into the System template s command documentation at run time.
Open Source Code Yes Second, we build, evaluate, and open-source SWE-agent, a system that provides LMs an ACI for solving real-world software engineering tasks. ... Data, code, and leaderboard at swe-agent.com
Open Datasets Yes We primarily evaluate on the SWE-bench dataset, which includes 2,294 task instances from 12 different repositories of popular Python packages [20]. ... We also test SWE-agent s basic code editing abilities with Human Eval Fix, a short-form code debugging benchmark [32]. ... Table 12: Information about each of the datasets that we use for evaluation: SWE-bench [20] and Human Eval Fix [32]. Both datasets have been released under permissive software licenses that allow for evaluation use, and can be used in proprietary systems.
Dataset Splits Yes We performed a hyperparameter sweep using a subset of 37 instances sampled randomly from the dev split of SWE-bench.
Hardware Specification No All results, ablations, and analyses are based on two leading LMs, GPT-4 Turbo (gpt-4-1106-preview) [34] and Claude 3 Opus (claude-3-opus-20240229) [6]. ...GPT-4 Turbo and Claude 3 Opus have 128k and 200k token context windows, respectively, which provides sufficient room for the LM to interact for several turns after being fed the system prompt, issue description, and optionally, a demonstration.
Software Dependencies No Built atop the Linux shell, SWE-agent also allows access to common Linux commands and utilities when needed. ... we integrate a code linter into the edit function to alert the agent of mistakes it may have introduced when editing a file. ... flake8 --isolated --select=F821,F822,F831,E111,E112,E113,E999,E902 "$CURRENT_FILE" 2>&1
Experiment Setup Yes For the remaining hyperparameter choices, we performed a sweep over the window size, history processing, and decoding temperature, shown in B.1. ... Table 5: Hyper parameter sweep results on a subset of the SWE-bench dev split. % Resolved shows the mean score across 5 samples. Model Temperature Window History % Resolved GPT-4 Turbo 0.0 100 Last 5 Obs. 15.1