Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing
Authors: Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, Xiangyu Zhang
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first evaluate REPOAUDIT upon fifteen real-world projects used in existing studies, with an average size of 251 KLo C. It is shown that REPOAUDIT effectively reproduces 21 previously reported bugs and uncovers 19 newly discovered bugs, 14 of which have already been fixed in the latest commit, achieving a precision of 78.43%. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Purdue University, West Lafayette, IN, USA. Correspondence to: Jinyao Guo <EMAIL>, Chengpeng Wang <EMAIL>. |
| Pseudocode | Yes | Figure 4. The prompt template for analyzing individual functions. Figure 5. The prompt template for feasibility validation. |
| Open Source Code | Yes | We have open-sourced REPOAUDIT at https: //github.com/Pur CL/Repo Audit. |
| Open Datasets | Yes | We first evaluate REPOAUDIT upon fifteen real-world projects used in existing studies... As shown by Table 1, we choose five well-maintained projects for each bug type from the bug reports of previous works, which mostly have thousands of stars on Git Hub... Furthermore, we study the 2024 CWE Top 25 Most Dangerous Software Weaknesses... We choose two typical static bug detectors as the representatives of industrial tools, namely Meta INFER (Meta, 2025) and Amazon CODEGURU (Amazon, 2025). |
| Dataset Splits | No | The paper does not provide specific training/test/validation dataset splits, as its evaluation focuses on applying the REPOAUDIT agent to entire code repositories and comparing its bug detection performance against existing bug reports and tools, rather than training a machine learning model on pre-split datasets. |
| Hardware Specification | No | The paper states that REPOAUDIT is powered by specific LLM models (e.g., Claude 3.5 Sonnet, Deepseek R1), but it does not provide details about the underlying hardware (GPU/CPU models, memory) on which these LLMs or the REPOAUDIT framework itself were run. It mentions computational resources from the Center for AI Safety but without specific hardware specifications. |
| Software Dependencies | No | The paper mentions 'tree-sitter parsing library' but does not specify its version number. It lists LLM models like 'Claude 3.5 Sonnet' which are services rather than installable software with explicit version numbers for dependencies. For comparison, it mentions 'Meta INFER (Meta, 2025)' and 'Amazon CODEGURU (Amazon, 2025)', but these are external tools, not direct dependencies of REPOAUDIT with version specifications. |
| Experiment Setup | Yes | We set the temperature to 0.0 to reduce the randomness. Similar to existing code auditing works (Heo et al., 2017), we introduce an upper bound K on the calling context and set it to 4, i.e., REPOAUDIT investigates data-flow facts across a maximum of four functions. In addition, we assess REPOAUDIT powered by Claude 3.5 Sonnet under the temperatures of 0.25, 0.5, 0.75, and 1.0. |