Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multi-Turn Jailbreaking Large Language Models via Attention Shifting

Authors: Xiaohu Du, Fan Mo, Ming Wen, Tu Gu, Huadi Zheng, Hai Jin, Jie Shi

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on three LLMs and two datasets show that our approach surpasses existing approaches in jailbreak effectiveness, the stealth of jailbreak prompts, and attack efficiency. Our work emphasizes the importance of enhancing the robustness of LLMs attention mechanism in multi-turn dialogue scenarios for a better defense strategy.
Researcher Affiliation	Collaboration	Xiaohu Du1,2,3,4, Fan Mo7, Ming Wen1,2,3,4,6,*, Tu Gu7, Huadi Zheng7, Hai Jin2,3,5, Jie Shi7 1 School of Cyber Science and Engineering, Huazhong University of Science and Technology (HUST) 2 National Engineering Research Center for Big Data Technology and System 3 Services Computing Technology and System Lab 4 Hubei Engineering Research Center on Big Data Security and Hubei Key Laboratory of Distributed System Security 5 Cluster and Grid Computing Lab, School of Computer Science and Technology, HUST 6 Jin Yin Hu Laboratory 7 Huawei International
Pseudocode	Yes	Algorithm 1: Attention Shifting for Jailbreaking Require: Attack model FA, Target model FT , Judge model FJ, Max iterations G, Population size N Input: Harmful query X Output: Harmful multi-turn dialog, Harmful response
Open Source Code	No	The paper does not contain an explicit statement about releasing its own source code, nor does it provide a direct link to a code repository for the methodology described. While it references an 'uncensored model' with a HuggingFace link, this is a third-party model used by the authors, not their own implementation code.
Open Datasets	Yes	Dataset. In this study, we utilize the Question List (Yu et al. 2023) dataset, which includes 100 queries covering various prohibited scenarios such as illegal activities, unethical practices, discriminatory speech, and toxic content. The second dataset is Adv Bench (Zou et al. 2023), which contains 520 instances of harmful behaviors across seven scenarios: Illegal Activity , Hate Speech , Malware , Physical Harm , Economic Harm , Fraud , and Privacy Violence (Ding et al. 2023).
Dataset Splits	No	The paper mentions using 100 multi-turn queries from the Question List and 520 instances from Adv Bench for evaluation. It notes that 22 out of 100 queries were successful jailbreaks and 78 failed on LLaMA-2. However, it does not specify any training, validation, or test splits for its own experimental setup, nor does it provide percentages or counts for these splits.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, memory specifications, or details about the computing infrastructure used for running the experiments.
Software Dependencies	No	The paper lists various LLMs used (e.g., LLa MA-2, LLa MA-3.1, Qwen-2, GPT-3.5, GPT-4o, GPT-2 for PPL calculation) but does not provide specific version numbers for underlying software libraries, frameworks (like PyTorch, TensorFlow), or programming languages (like Python) that would be needed for replication.
Experiment Setup	Yes	To balance the effectiveness and efficiency of multi-turn queries, we set the number of turns to 5. To maintain the number of queries to the target model by ASJA within a low range, we set the population size N to 10, the maximum number of iterations G to 5, and both the uniform crossover probability and mutation probability to 0.5.