Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AgentAuditor: Human-level Safety and Security Evaluation for LLM Agents

Authors: Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, Hanan Salam

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that Agent Auditor not only consistently improves the evaluation performance of LLMs across all benchmarks but also sets a new state-of-the-art in LLM-as-a-judge for agent safety and security, achieving human-level accuracy. Our work is openly accessible at https://github.com/Astarojth/Agent Auditor-ASSEBench.
Researcher Affiliation Academia 1New York University Abu Dhabi, 2Nanyang Technological University, 3KTH Royal Institute of Technology, 4University of Sydney, 5University of Illinois Urbana-Champaign, 6National University of Singapore, 7Mohamed bin Zayed University of AI
Pseudocode Yes Algorithm 1 FINCH-based Representative Selection in Agent Auditor Algorithm 2 Balanced Allocation Algorithm Algorithm 3 Automated Classification Pipeline Algorithm 4 Balanced Screening Algorithm
Open Source Code Yes Our work is openly accessible at https://github.com/Astarojth/Agent Auditor-ASSEBench.
Open Datasets Yes Moreover, we develop ASSEBench, the first benchmark designed to check how well LLM-based evaluators can spot both safety risks and security threats. ASSEBench comprises 2293 meticulously annotated interaction records, covering 15 risk types across 29 application scenarios. A key feature of ASSEBench is its nuanced approach to ambiguous risk situations, employing Strict and Lenient judgment standards. Experiments demonstrate that Agent Auditor not only consistently improves the evaluation performance of LLMs across all benchmarks but also sets a new state-of-the-art in LLM-as-a-judge for agent safety and security, achieving human-level accuracy. Our work is openly accessible at https://github.com/Astarojth/Agent Auditor-ASSEBench.
Dataset Splits No The paper mentions that ASSEBench comprises four subsets (ASSEBench-Security, ASSEBench-Safety, ASSEBench-Strict, and ASSEBench-Lenient) with specific record counts. It also states, 'Notably, since Agent Auditor uses a portion of the test dataset to generate Co T reasoning, this subset is consistently excluded from all evaluations, regardless of whether the model is tested with or without Agent Auditor, to prevent data contamination.' However, it does not provide specific percentages, absolute counts, or detailed methodologies for how these 'portions' or subsets are used as explicit train/test/validation splits for the main experimental evaluations, or for any model training within the paper's methodology (Agent Auditor itself is training-free).
Hardware Specification Yes All experiments are performed on an HPC node equipped with 4 Nvidia Tesla A100 80G GPUs, 2 EPYC 64-core CPUs, and 480GB of RAM.
Software Dependencies No The paper mentions "All used models are configured to BF16 precision and deployed with the v LLM framework [40]." While it names the vLLM framework, it does not specify a version number for it or any other key software libraries or dependencies, which is required for reproducible software details.
Experiment Setup Yes Unless otherwise specified, all experiments use fixed parameters: N = 8, K = 3, and an embedding dimension of 512. To compare Agent Auditor with small-parameter LLMs fine-tuned specifically for safety evaluation tasks, we select Shield Agent [99] (fine-tuned from Qwen-2.5-7B) and Llama-Guard-3 [34] (fine-tuned from Llama-3.1-8B) as representative examples of this conventional approach. Regarding prompts, all base models utilize the prompt template proposed in R-Judge. Shield Agent and Llama-Guard-3, however, use their pre-defined prompt templates, as fine-tuned models typically require task-specific prompts to fully demonstrate their performance. For the embedding model used in Agent Auditor, we select Nomic-Embed-Text-v1.5 [63], the most frequently downloaded text embedding model on Ollama [64]. More details about the parameter values used in our experiments are provided in Appendix K.