Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Analogy-based Multi-Turn Jailbreak against Large Language Models

Authors: Mengjie Wu, Yihao Huang, Zhenjun Lin, Kangjie Chen, Yuyang zhang, Yuhan Huang, Run Wang, Lina Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To sum up, our work has the following contributions: Extensive experiments on two public benchmarks and six target commercial or open-source LLMs, and six baselines confirm the effectiveness of AMA. We achieve 96.0% attack success rate (ASR) on Adv Bench and 90.7% on Jailbreak Bench, outperforming five baselines. AMA also maintains over 90% ASR on commercial models like GPT-4o-mini and Deep Seek-R1, while producing more semantically aligned and realistically harmful outputs. Section 5: Experiments. 5.1 Experimental Setup. Datasets. Model. Baselines. Metrics. Implementation details. 5.2 Attack Evaluation. Attack effectiveness. Attack consistency. Harmfulness of response. Against defense methods. Attack efficiency. 5.3 Ablation Study.
Researcher Affiliation Academia Mengjie Wu1 , Yihao Huang2 , Zhenjun Lin1, Kangjie Chen3, Yuyang Zhang1, Yuhan Huang4, Run Wang1 , Lina Wang1 1 Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, China 2 National University of Singapore, Singapore 3 Nanyang Technological University, Singapore 4 School of Economics and Management, Hubei University of Technology
Pseudocode No The paper describes its methodology in Section 4 with text and a framework overview figure (Figure 4), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is released at AMA. WARNING: This paper contains potentially unsafe examples. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Our code is available at the anonymous website https://anonymous.4open.science/r/AMA-E1C4
Open Datasets Yes Datasets. Following the previous work, such as PAIR [5], TAP [25], Co A [39], CFA [32], we selected three representative datasets, Advbench Subset [5], Jailbreakbench [4] and Strong REJECT [31]
Dataset Splits No The paper mentions using subsets of Advbench (50 questions), Jailbreakbench (100 instructions), and Strong REJECT (221 prompts from custom category). However, it does not specify how these datasets were further split into training, validation, or test sets for the experimental evaluation of AMA. The paper describes the *selection* of prompts, not *splits* for training/testing the attack method itself. For instance, it says "Advbench Subset [5] is based on 50 representative questions selected from the Adv Bench dataset" and "we adopt the 221 prompts from the custom category as our evaluation subset" but doesn't detail train/test/val splits.
Hardware Specification No Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [No] Justification: The paper does not report specific details on compute resources such as GPU type, memory, or runtime. This is because all experiments are conducted via commercially deployed LLM APIs. compute-related factors such as execution time depend on the specific model API used.
Software Dependencies Yes Model. We evaluate the attack performance of our AMA method on 6 popular LLMs, including 3 open-source LLMs: vicuna-1.5-13b [44], llama-3.1-70b [9], qwen-2.5-72b [37], and 3 closed-source commercial LLMs: gpt-3.5-turbo [26], gpt-4o-mini [14] and deepseek-r1 [10]. Implementation details. We adopt the llama-3.1-70b model as the attack model Mattack to generate adversarial prompts and gpt-4o-mini is used as harmfulness evaluator and the judge model Jailbreak Judge during the optimization process, while the reasoning-oriented LLM deepseek-r1 is employed to assess attack consistency.
Experiment Setup Yes Implementation details. We adopt the llama-3.1-70b model as the attack model Mattack to generate adversarial prompts and gpt-4o-mini is used as harmfulness evaluator and the judge model Jailbreak Judge during the optimization process, while the reasoning-oriented LLM deepseek-r1 is employed to assess attack consistency. Following the setting of Co A [39], we use a temperature of 1 and a top-k value of 0.9 for Mattack and temperature of 0 for target LLMs. We set conversation turns T = 3 and maximum attack iterations K = 3 to balance effectiveness and computational cost. For the implementation of baseline methods, we follow their experimental settings, but standardize the number of attack iterations to three and set the maximum output tokens to 4096 to ensure a fair comparison.