SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents

Authors: Niels Mündler, Mark Müller, Jingxuan He, Martin Vechev

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose a novel benchmark based on popular Git Hub repositories, containing real-world issues, ground-truth bug-fixes, and golden tests. We find that LLMs generally perform surprisingly well at generating relevant test cases, with Code Agents designed for code repair exceeding the performance of systems designed specifically for test generation. Further, as test generation is a similar but more structured task than code generation, it allows for a more fine-grained analysis using issue reproduction rate and coverage changes, providing a dual metric for analyzing systems designed for code repair. Finally, we find that generated tests are an effective filter for proposed code fixes, doubling the precision of SWE-AGENT. We release all data and code at github.com/logic-star-ai/SWT-Bench.
Researcher Affiliation Collaboration Niels Mündler1, Mark Niklas Müller1,2, Jingxuan He1, Martin Vechev1 1 Department of Computer Science, ETH Zurich 2 Logic Star.ai {niels.muendler, mark.mueller, jingxuan.he, martin.vechev}@inf.ethz.ch 1 mark@logicstar.ai 2
Pseudocode No The paper includes structured representations of a custom diff format in Figure 7 and Appendix A, but these are specifications of a format rather than pseudocode or an algorithm block explicitly labeled as such.
Open Source Code Yes We release all data and code at github.com/logic-star-ai/SWT-Bench.
Open Datasets Yes To construct SWT-BENCH, we leverage the same underlying data as SWE-BENCH (Jimenez et al., 2023)
Dataset Splits No The paper mentions evaluating on SWT-BENCH-LITE, a subset of 276 issues, but does not provide specific train/validation/test splits for its own experimental setup or for the benchmark usage beyond what is stated about its construction.
Hardware Specification No The paper does not specify the exact hardware used for running its experiments (e.g., specific GPU models, CPU types, or memory).
Software Dependencies No The paper lists various LLMs used (e.g., GPT-4 (gpt-4-1106-preview), Claude 3.0 Haiku, Mistral Large 2) along with their specific versions or release dates, which serve as key components of their experimental setup. However, it does not list other ancillary software dependencies like programming language versions (e.g., Python version) or specific library versions.
Experiment Setup Yes We sample at temperature t = 0 for all zero-shot methods and agents and at t = 0.7 for LIBRO and PASS@5. For SWE-AGENT, AIDER, and AUTOCODEROVER, we use their default settings, restricting the number of API calls to 20, reflection steps to 3, and interaction rounds to 10, respectively. For LIBRO we sample 5 tests.