Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing

Authors: Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, Xiangyu Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first evaluate REPOAUDIT upon fifteen real-world projects used in existing studies, with an average size of 251 KLo C. It is shown that REPOAUDIT effectively reproduces 21 previously reported bugs and uncovers 19 newly discovered bugs, 14 of which have already been fixed in the latest commit, achieving a precision of 78.43%.
Researcher Affiliation	Academia	1Department of Computer Science, Purdue University, West Lafayette, IN, USA. Correspondence to: Jinyao Guo <EMAIL>, Chengpeng Wang <EMAIL>.
Pseudocode	Yes	Figure 4. The prompt template for analyzing individual functions. Figure 5. The prompt template for feasibility validation.
Open Source Code	Yes	We have open-sourced REPOAUDIT at https: //github.com/Pur CL/Repo Audit.
Open Datasets	Yes	We first evaluate REPOAUDIT upon fifteen real-world projects used in existing studies... As shown by Table 1, we choose five well-maintained projects for each bug type from the bug reports of previous works, which mostly have thousands of stars on Git Hub... Furthermore, we study the 2024 CWE Top 25 Most Dangerous Software Weaknesses... We choose two typical static bug detectors as the representatives of industrial tools, namely Meta INFER (Meta, 2025) and Amazon CODEGURU (Amazon, 2025).
Dataset Splits	No	The paper does not provide specific training/test/validation dataset splits, as its evaluation focuses on applying the REPOAUDIT agent to entire code repositories and comparing its bug detection performance against existing bug reports and tools, rather than training a machine learning model on pre-split datasets.
Hardware Specification	No	The paper states that REPOAUDIT is powered by specific LLM models (e.g., Claude 3.5 Sonnet, Deepseek R1), but it does not provide details about the underlying hardware (GPU/CPU models, memory) on which these LLMs or the REPOAUDIT framework itself were run. It mentions computational resources from the Center for AI Safety but without specific hardware specifications.
Software Dependencies	No	The paper mentions 'tree-sitter parsing library' but does not specify its version number. It lists LLM models like 'Claude 3.5 Sonnet' which are services rather than installable software with explicit version numbers for dependencies. For comparison, it mentions 'Meta INFER (Meta, 2025)' and 'Amazon CODEGURU (Amazon, 2025)', but these are external tools, not direct dependencies of REPOAUDIT with version specifications.
Experiment Setup	Yes	We set the temperature to 0.0 to reduce the randomness. Similar to existing code auditing works (Heo et al., 2017), we introduce an upper bound K on the calling context and set it to 4, i.e., REPOAUDIT investigates data-flow facts across a maximum of four functions. In addition, we assess REPOAUDIT powered by Claude 3.5 Sonnet under the temperatures of 0.25, 0.5, 0.75, and 1.0.