Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CoP: Agentic Red-teaming for Large Language Models using Composition of Principles

Authors: Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental When tested against leading LLMs, Co P1 reveals unprecedented safety risks by finding novel jailbreak prompts and improving the best-known single-turn attack success rate by up to 19.0 times. Figure 2 presents a comparative analysis of Co P s attack success rate against the leading open-source and proprietary LLMs, demonstrating substantial gains over state-of-the-art single-turn jailbreak methods. We conduct our experiments using the Harm Bench dataset [18].
Researcher Affiliation Collaboration Chen Xiong The Chinese University of Hong Kong Sha Tin, Hong Kong EMAIL Pin-Yu Chen IBM Research New York, USA EMAIL Tsung-Yi Ho The Chinese University of Hong Kong Sha Tin, Hong Kong EMAIL
Pseudocode Yes Algorithm 1 Composition-of-Principles (COP) Algorithm
Open Source Code No 1Project Page available at: https://huggingface.co/spaces/Trust Safe AI/Co P/
Open Datasets Yes We conduct our experiments using the Harm Bench dataset [18], which contains 400 malicious queries designed to represent violations of legal standards and social norms.
Dataset Splits Yes We conduct our experiments using the Harm Bench dataset [18], which contains 400 malicious queries designed to represent violations of legal standards and social norms. [...] for the latter two models, we report the results based on 50 randomly sampled queries from Harmbench. [...] The experiment was conducted on the Llama-27B-Chat model using 50 randomly sampled queries from Harmbench, with results evaluated by the Harmbench classifier.
Hardware Specification Yes As majority of experiment in Sec. 4 are conducted under a single A800 GPU with 80GB of memory. However, some of the Target LLMs requires more than one GPU. The maximum usage of running Co P pipeline with 70B Target LLM will be 4 A800 GPU with 80GB, which will be the maximum costs for running the all the experiments.
Software Dependencies No The paper mentions specific models used like Grok-2 and GPT-4, but does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes Our main hyperparameter is the Number of Attack Attempts. We set the attack attempts to be 10 for the majority of experiment. We set out attack attempts to be 20 in Sec. 4.2 for all the jailbreak methods for consistency. Additionally, we set the jailbreak threshold to η = 10 and the similarity threshold to τ = 1. Due to better alignment of O1 and Claude 3.5 Sonnet, we set the jailbreak threshold to η >= 7 and keep the similarity threshold the same.