Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Authors: Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Yarin Gal, Xander Davies

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To facilitate research on LLM agent misuse, we propose a new benchmark called Agent Harm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on Agent Harm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. To enable simple and reliable evaluation of attacks and defenses for LLM-based agents, we publicly release Agent Harm at https://huggingface.co/datasets/ ai-safety-institute/Agent Harm.
Researcher Affiliation	Collaboration	Maksym Andriushchenko1, , , Alexandra Souly2, Mateusz Dziemian1, Derek Duenas1, Maxwell Lin1, Justin Wang1, Dan Hendrycks1, , Andy Zou1, , , Zico Kolter1, , Matt Fredrikson1, , Yarin Gal2, , Xander Davies2, , 1Gray Swan AI, 2UK AI Security Institute, Core contributor Additional Affiliations: EPFL , Center for AI Safety , Carnegie Mellon University , University of Oxford .
Pseudocode	Yes	Figure 2: An illustrative example of behavior, tool implementation, and scoring function. Note that most scoring functions and some tool implementations can be much longer due to extra content, such as HTML webpages or blocks of messages.
Open Source Code	Yes	To enable simple and reliable evaluation of attacks and defenses for LLM-based agents, we publicly release Agent Harm at https://huggingface.co/datasets/ ai-safety-institute/Agent Harm.
Open Datasets	Yes	To enable simple and reliable evaluation of attacks and defenses for LLM-based agents, we publicly release Agent Harm at https://huggingface.co/datasets/ ai-safety-institute/Agent Harm.
Dataset Splits	Yes	Dataset splits. We split all behaviors on three categories: validation behaviors (10%), public test behaviors (60%), and private test behaviors (30%).
Hardware Specification	No	The paper mentions costs for GPT-4o and GPT-4o-mini and refers to 'our own deployment of frontier LLMs on Azure' but does not specify any particular hardware like GPU/CPU models or specific Azure instance types.
Software Dependencies	No	The paper mentions various LLM models with their checkpoints (e.g., 'openai/gpt-3.5-turbo-0125') and refers to the 'Inspect' framework, but does not provide specific version numbers for any software libraries, programming languages, or the Inspect framework itself.
Experiment Setup	Yes	Evaluation settings. In our main evaluation, we consider a basic agentic setting, i.e., direct prompting with a simple while loop where the result of tool executions is fed into the models context. By default, we do not use a more complex scaffolding or elaborate prompting that may enhance both the capability and robustness of the agent, though we suspect it is possible for specially designed scaffolds to improve performance on Agent Harm. Unless otherwise specified, we use temperature 0 with up to 4096 generated output tokens. We use GPT-4o as a semantic judge for grading particular aspects of the agent s output and for tracking refusals. We use a basic system prompt which provides extra guidance to the LLM agent to encourage successful tool calls and avoid common failure modes like asking the user to perform the action (see Table 4 in Appendix B for the full system prompt).