Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Multi-Turn Jailbreaking Large Language Models via Attention Shifting
Authors: Xiaohu Du, Fan Mo, Ming Wen, Tu Gu, Huadi Zheng, Hai Jin, Jie Shi
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on three LLMs and two datasets show that our approach surpasses existing approaches in jailbreak effectiveness, the stealth of jailbreak prompts, and attack efficiency. Our work emphasizes the importance of enhancing the robustness of LLMs attention mechanism in multi-turn dialogue scenarios for a better defense strategy. |
| Researcher Affiliation | Collaboration | Xiaohu Du1,2,3,4, Fan Mo7, Ming Wen1,2,3,4,6,*, Tu Gu7, Huadi Zheng7, Hai Jin2,3,5, Jie Shi7 1 School of Cyber Science and Engineering, Huazhong University of Science and Technology (HUST) 2 National Engineering Research Center for Big Data Technology and System 3 Services Computing Technology and System Lab 4 Hubei Engineering Research Center on Big Data Security and Hubei Key Laboratory of Distributed System Security 5 Cluster and Grid Computing Lab, School of Computer Science and Technology, HUST 6 Jin Yin Hu Laboratory 7 Huawei International |
| Pseudocode | Yes | Algorithm 1: Attention Shifting for Jailbreaking Require: Attack model FA, Target model FT , Judge model FJ, Max iterations G, Population size N Input: Harmful query X Output: Harmful multi-turn dialog, Harmful response |
| Open Source Code | No | The paper does not contain an explicit statement about releasing its own source code, nor does it provide a direct link to a code repository for the methodology described. While it references an 'uncensored model' with a HuggingFace link, this is a third-party model used by the authors, not their own implementation code. |
| Open Datasets | Yes | Dataset. In this study, we utilize the Question List (Yu et al. 2023) dataset, which includes 100 queries covering various prohibited scenarios such as illegal activities, unethical practices, discriminatory speech, and toxic content. The second dataset is Adv Bench (Zou et al. 2023), which contains 520 instances of harmful behaviors across seven scenarios: Illegal Activity , Hate Speech , Malware , Physical Harm , Economic Harm , Fraud , and Privacy Violence (Ding et al. 2023). |
| Dataset Splits | No | The paper mentions using 100 multi-turn queries from the Question List and 520 instances from Adv Bench for evaluation. It notes that 22 out of 100 queries were successful jailbreaks and 78 failed on LLaMA-2. However, it does not specify any training, validation, or test splits for its own experimental setup, nor does it provide percentages or counts for these splits. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, memory specifications, or details about the computing infrastructure used for running the experiments. |
| Software Dependencies | No | The paper lists various LLMs used (e.g., LLa MA-2, LLa MA-3.1, Qwen-2, GPT-3.5, GPT-4o, GPT-2 for PPL calculation) but does not provide specific version numbers for underlying software libraries, frameworks (like PyTorch, TensorFlow), or programming languages (like Python) that would be needed for replication. |
| Experiment Setup | Yes | To balance the effectiveness and efficiency of multi-turn queries, we set the number of turns to 5. To maintain the number of queries to the target model by ASJA within a low range, we set the population size N to 10, the maximum number of iterations G to 5, and both the uniform crossover probability and mutation probability to 0.5. |