SWE-bench: Can Language Models Resolve Real-world Github Issues?

Authors: Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik R Narasimhan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To this end, we introduce SWE-bench, an evaluation framework consisting of 2,294 software engineering problems drawn from real Git Hub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere 1.96% of the issues.
Researcher Affiliation Academia 1Princeton University 2Princeton Language and Intelligence 3University of Chicago
Pseudocode No The paper describes procedures in natural language and provides prompt examples but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code Yes Data, code, and leaderboard at swebench.com
Open Datasets Yes To this end, we introduce SWE-bench, an evaluation framework consisting of 2,294 software engineering problems drawn from real Git Hub issues and corresponding pull requests across 12 popular Python repositories. This dataset comprises a collection of 19,000 non-testing task instances derived from 37 repositories. Data, code, and leaderboard at swebench.com
Dataset Splits Yes In addition to the evaluation test set, we also provide a development set for evaluating models and tuning hyperparameters before running on the final test set. Following the style of tables and graphs from before, we present similar statistics to characterize the 225 development task instances (slightly more than 10% of the main evaluation set) collected from 6 open source repositories with licenses permitting such usage.
Hardware Specification Yes SWE-Llama 7b was initialized with Code Llama-Python 7b and trained in 20 hours on 4 NVIDIA A100s. SWE-Llama 13b was initialized with Code Llama-Python 13b and trained in 47 hours on 8 NVIDIA A100s.
Software Dependencies No The paper mentions software like Code Llama, Deep Speed Ulysses, Flash Attention, pytest, tox, and Radon, but does not specify their version numbers for reproducibility.
Experiment Setup Yes We finetune using Lo RA (Hu et al., 2022) with r = 16, α = 16, dropout = 0.05, on the query, key, value, and output projection matrices of every attention sublayer. We train with a learning rate of 6e 4 and a batch size of 32 sequences per gradient step for a maximum of 4 epochs. We experiment with three different maximum context limits, and simply retrieve as many files as fits within the specified limit.