AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Authors: Xiaogeng Liu, Nan Xu, Muhao Chen, Chaowei Xiao
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations demonstrate that Auto DAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, we also compare Auto DAN with perplexity-based defense methods and show that Auto DAN can bypass them effectively. |
| Researcher Affiliation | Academia | Xiaogeng Liu 1 Nan Xu 2 Muhao Chen 3 Chaowei Xiao 1 1 University of Wisconsin Madison, 2 USC, 3 University of California, Davis |
| Pseudocode | Yes | Algorithm 1 Genetic Algorithm, Algorithm 2 Auto DAN-HGA, Algorithm 3 Genetic Algorithm, Algorithm 4 Hierarchical Genetic Algorithm, Algorithm 5 LLM-based Diversification, Algorithm 6 Crossover Function, Algorithm 7 Apply Crossover and Mutation, Algorithm 8 Construct Momentum Word Dictionary, Algorithm 9 Replace Words with Synonyms, Algorithm 10 Auto DAN-GA, Algorithm 11 GPT-Recheck |
| Open Source Code | Yes | Code is available at https://github.com/Shelton Liu-N/Auto DAN. |
| Open Datasets | Yes | Dataset. We use Adv Bench Harmful Behaviors introduced by Zou et al. (2023) to evaluate the jailbreak attacks. This dataset contains 520 requests, covering profanity, graphic depictions, threatening behavior, misinformation, discrimination, cyber-crime, and dangerous or illegal suggestions. |
| Dataset Splits | No | The paper states, 'We use Adv Bench Harmful Behaviors introduced by Zou et al. (2023) to evaluate the jailbreak attacks.' and 'We conduct these evaluations by generating a jailbreak prompt for each malicious request in the dataset and testing the final responses from the victim LLM.' This indicates the dataset is used for evaluation, but explicit training, validation, and test splits for the dataset are not provided for the Auto DAN method itself. |
| Hardware Specification | Yes | We calculate the time cost on a single NVIDIA A100 80GB with AMD EPYC 7742 64-Core Processor. |
| Software Dependencies | Yes | In this paper, we use Openai s GPT-4 API Open AI (2023b) to conduct LLM-based diversification. |
| Experiment Setup | Yes | We configure the hyper-parameters of Auto DAN and Auto DAN-HGA as follows: a crossover rate of 0.5, a mutation rate of 0.01, an elite rate of 0.1, and five breakpoints for multi-point crossover. The total number of iterations is fixed at 100. Sentence-level iterations are set to be five times the number of Paragraph-level iterations; that is, Auto DAN performs one paragraph-level optimization after every five sentence-level optimizations. |