Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks
Authors: Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, LINGMING ZHANG
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | A comprehensive evaluation of state-of-the-art LLM code agents reveals significant performance gaps, achieving at most 18.0% success in Po C generation and 34.0% in vulnerability patching on our complete dataset. |
| Researcher Affiliation | Academia | Hwiwon Lee Ziqi Zhang Hanxiao Lu Lingming Zhang University of Illinois Urbana-Champaign Purdue University EMAIL EMAIL |
| Pseudocode | Yes | D SECVERIFIER Prompt Templates This section elaborates on the prompt templates used by SECVERIFIER for verifying vulnerability dataset. D.1, D.2, and D.3 provide the prompts for the Builder, Exploiter, and Fixer agents, respectively. For fair comparison in the ablation study in Table 3, we provide an integrated prompt for the single agent in D.4. |
| Open Source Code | Yes | Code https://github.com/SEC-bench/SEC-bench |
| Open Datasets | Yes | Dataset https://hf.co/datasets/SEC-bench/SEC-bench |
| Dataset Splits | Yes | Due to budget constraints, we evaluate the best-performing agent on the full dataset, while a detailed comparison among all agents is conducted using 80 representative instances from SEC-bench. For Po C generation, we provide the vulnerability description, harnesses, and the codebase within a Docker environment. For vulnerability patching, we provide the vulnerability description with call stack information, harnesses, and the codebase within a Docker environment. ... We randomly select 15 instances before and 15 instances after the LLM s knowledge cutoff (KC) date based on CVE reserved dates. |
| Hardware Specification | No | Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We describe the compute resources in the experimental setting of the appendix. (Note: The appendix, specifically Section B.3 'Code Agent Configurations', describes software configurations but does not provide specific hardware details such as GPU or CPU models.) |
| Software Dependencies | Yes | To comprehensively measure LLM agent capabilities in security tasks, we select three SOTA code agents: SWE-agent [70], Open Hands [63], and Aider [6]. We also choose three strong representative models: Claude 3.7 Sonnet [9], GPT-4o [41], and o3-mini [44]. ... For SWE-agent (version 1.0.1) and Open Hands (version 0.33.0), we set the temperature to 0.0 for all LLMs. The maximum iterations for these agents are 75. The cost limit for these agents are 1.5 for Claude 3.7 Sonnet and 1.0 for GPT-4o and o3-mini. Aider (version 0.82.0) is also configured with a temperature of 0.0 |
| Experiment Setup | Yes | For SWE-agent (version 1.0.1) and Open Hands (version 0.33.0), we set the temperature to 0.0 for all LLMs. The maximum iterations for these agents are 75. The cost limit for these agents are 1.5 for Claude 3.7 Sonnet and 1.0 for GPT-4o and o3-mini. Aider (version 0.82.0) is also configured with a temperature of 0.0; specific iteration and cost limits are not applicable as it operates differently. All agents execute within the same Docker environment. |