reproducibilityindex.ai

SWE-bench: Can Language Models Resolve Real-world Github Issues?

Authors: Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik R Narasimhan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To this end, we introduce SWE-bench, an evaluation framework consisting of 2,294 software engineering problems drawn from real Git Hub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere 1.96% of the issues.
Researcher Affiliation	Academia	1Princeton University 2Princeton Language and Intelligence 3University of Chicago
Pseudocode	No	The paper describes procedures in natural language and provides prompt examples but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code	Yes	Data, code, and leaderboard at swebench.com
Open Datasets	Yes	To this end, we introduce SWE-bench, an evaluation framework consisting of 2,294 software engineering problems drawn from real Git Hub issues and corresponding pull requests across 12 popular Python repositories. This dataset comprises a collection of 19,000 non-testing task instances derived from 37 repositories. Data, code, and leaderboard at swebench.com
Dataset Splits	Yes	In addition to the evaluation test set, we also provide a development set for evaluating models and tuning hyperparameters before running on the final test set. Following the style of tables and graphs from before, we present similar statistics to characterize the 225 development task instances (slightly more than 10% of the main evaluation set) collected from 6 open source repositories with licenses permitting such usage.
Hardware Specification	Yes	SWE-Llama 7b was initialized with Code Llama-Python 7b and trained in 20 hours on 4 NVIDIA A100s. SWE-Llama 13b was initialized with Code Llama-Python 13b and trained in 47 hours on 8 NVIDIA A100s.
Software Dependencies	No	The paper mentions software like Code Llama, Deep Speed Ulysses, Flash Attention, pytest, tox, and Radon, but does not specify their version numbers for reproducibility.
Experiment Setup	Yes	We finetune using Lo RA (Hu et al., 2022) with r = 16, α = 16, dropout = 0.05, on the query, key, value, and output projection matrices of every attention sublayer. We train with a learning rate of 6e 4 and a batch size of 32 sequences per gradient step for a maximum of 4 epochs. We experiment with three different maximum context limits, and simply retrieve as many files as fits within the specified limit.