reproducibilityindex.ai

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Authors: Xiaogeng Liu, Nan Xu, Muhao Chen, Chaowei Xiao

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations demonstrate that Auto DAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, we also compare Auto DAN with perplexity-based defense methods and show that Auto DAN can bypass them effectively.
Researcher Affiliation	Academia	Xiaogeng Liu 1 Nan Xu 2 Muhao Chen 3 Chaowei Xiao 1 1 University of Wisconsin Madison, 2 USC, 3 University of California, Davis
Pseudocode	Yes	Algorithm 1 Genetic Algorithm, Algorithm 2 Auto DAN-HGA, Algorithm 3 Genetic Algorithm, Algorithm 4 Hierarchical Genetic Algorithm, Algorithm 5 LLM-based Diversification, Algorithm 6 Crossover Function, Algorithm 7 Apply Crossover and Mutation, Algorithm 8 Construct Momentum Word Dictionary, Algorithm 9 Replace Words with Synonyms, Algorithm 10 Auto DAN-GA, Algorithm 11 GPT-Recheck
Open Source Code	Yes	Code is available at https://github.com/Shelton Liu-N/Auto DAN.
Open Datasets	Yes	Dataset. We use Adv Bench Harmful Behaviors introduced by Zou et al. (2023) to evaluate the jailbreak attacks. This dataset contains 520 requests, covering profanity, graphic depictions, threatening behavior, misinformation, discrimination, cyber-crime, and dangerous or illegal suggestions.
Dataset Splits	No	The paper states, 'We use Adv Bench Harmful Behaviors introduced by Zou et al. (2023) to evaluate the jailbreak attacks.' and 'We conduct these evaluations by generating a jailbreak prompt for each malicious request in the dataset and testing the final responses from the victim LLM.' This indicates the dataset is used for evaluation, but explicit training, validation, and test splits for the dataset are not provided for the Auto DAN method itself.
Hardware Specification	Yes	We calculate the time cost on a single NVIDIA A100 80GB with AMD EPYC 7742 64-Core Processor.
Software Dependencies	Yes	In this paper, we use Openai s GPT-4 API Open AI (2023b) to conduct LLM-based diversification.
Experiment Setup	Yes	We configure the hyper-parameters of Auto DAN and Auto DAN-HGA as follows: a crossover rate of 0.5, a mutation rate of 0.01, an elite rate of 0.1, and five breakpoints for multi-point crossover. The total number of iterations is fixed at 100. Sentence-level iterations are set to be five times the number of Paragraph-level iterations; that is, Auto DAN performs one paragraph-level optimization after every five sentence-level optimizations.