AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Authors: Xiaogeng Liu, Nan Xu, Muhao Chen, Chaowei Xiao

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations demonstrate that Auto DAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, we also compare Auto DAN with perplexity-based defense methods and show that Auto DAN can bypass them effectively.
Researcher Affiliation Academia Xiaogeng Liu 1 Nan Xu 2 Muhao Chen 3 Chaowei Xiao 1 1 University of Wisconsin Madison, 2 USC, 3 University of California, Davis
Pseudocode Yes Algorithm 1 Genetic Algorithm, Algorithm 2 Auto DAN-HGA, Algorithm 3 Genetic Algorithm, Algorithm 4 Hierarchical Genetic Algorithm, Algorithm 5 LLM-based Diversification, Algorithm 6 Crossover Function, Algorithm 7 Apply Crossover and Mutation, Algorithm 8 Construct Momentum Word Dictionary, Algorithm 9 Replace Words with Synonyms, Algorithm 10 Auto DAN-GA, Algorithm 11 GPT-Recheck
Open Source Code Yes Code is available at https://github.com/Shelton Liu-N/Auto DAN.
Open Datasets Yes Dataset. We use Adv Bench Harmful Behaviors introduced by Zou et al. (2023) to evaluate the jailbreak attacks. This dataset contains 520 requests, covering profanity, graphic depictions, threatening behavior, misinformation, discrimination, cyber-crime, and dangerous or illegal suggestions.
Dataset Splits No The paper states, 'We use Adv Bench Harmful Behaviors introduced by Zou et al. (2023) to evaluate the jailbreak attacks.' and 'We conduct these evaluations by generating a jailbreak prompt for each malicious request in the dataset and testing the final responses from the victim LLM.' This indicates the dataset is used for evaluation, but explicit training, validation, and test splits for the dataset are not provided for the Auto DAN method itself.
Hardware Specification Yes We calculate the time cost on a single NVIDIA A100 80GB with AMD EPYC 7742 64-Core Processor.
Software Dependencies Yes In this paper, we use Openai s GPT-4 API Open AI (2023b) to conduct LLM-based diversification.
Experiment Setup Yes We configure the hyper-parameters of Auto DAN and Auto DAN-HGA as follows: a crossover rate of 0.5, a mutation rate of 0.01, an elite rate of 0.1, and five breakpoints for multi-point crossover. The total number of iterations is fixed at 100. Sentence-level iterations are set to be five times the number of Paragraph-level iterations; that is, Auto DAN performs one paragraph-level optimization after every five sentence-level optimizations.