Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
Authors: Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, Amin Karbasi
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In empirical evaluations, we observe that TAP generates prompts that jailbreak state-of-the-art LLMs (including GPT4-Turbo and GPT4o) for more than 80% of the prompts. |
| Researcher Affiliation | Collaboration | Anay Mehrotra Yale University Robust Intelligence @ Cisco Manolis Zampetakis Yale University Paul Kassianik Robust Intelligence @ Cisco Blaine Nelson Robust Intelligence @ Cisco Hyrum Anderson Robust Intelligence @ Cisco Yaron Singer Robust Intelligence @ Cisco Amin Karbasi Robust Intelligence @ Cisco |
| Pseudocode | Yes | Algorithm 1: Tree of Attacks with Pruning (TAP) |
| Open Source Code | Yes | An implementation of TAP is submitted in the supplementary material. |
| Open Datasets | Yes | We use two datasets of prompts requesting harmful information. The first is Adv Bench Subset consisting of 50 requests for harmful information across 32 categories curated by Chao et al. [12]. The second dataset is new and has 123 harmful requests. These prompts are generated by querying Wizard Vicuna30B-Uncensored to generate variants of the prompts in Adv Bench Subset.5 This dataset is available at the following link: https://t.ly/Wn FP2 |
| Dataset Splits | No | The paper mentions using two datasets, Adv Bench Subset and a new dataset, for empirical evaluation. However, it does not explicitly specify training, validation, and test dataset splits for these experiments. |
| Hardware Specification | Yes | We ran all of our simulations on an Ubuntu Machine with an Nvidia A100 GPU, 256 Gb memory, and 1TB disk space. |
| Software Dependencies | No | The paper states 'We implement it in Python' but does not provide specific version numbers for Python or any other key software libraries or dependencies. |
| Experiment Setup | Yes | For TAP, we fix the maximum depth to d = 10, the maximum width to w = 10, and the branching factor to b = 4, respectively. |