Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

Authors: Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, Amin Karbasi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In empirical evaluations, we observe that TAP generates prompts that jailbreak state-of-the-art LLMs (including GPT4-Turbo and GPT4o) for more than 80% of the prompts.
Researcher Affiliation Collaboration Anay Mehrotra Yale University Robust Intelligence @ Cisco Manolis Zampetakis Yale University Paul Kassianik Robust Intelligence @ Cisco Blaine Nelson Robust Intelligence @ Cisco Hyrum Anderson Robust Intelligence @ Cisco Yaron Singer Robust Intelligence @ Cisco Amin Karbasi Robust Intelligence @ Cisco
Pseudocode Yes Algorithm 1: Tree of Attacks with Pruning (TAP)
Open Source Code Yes An implementation of TAP is submitted in the supplementary material.
Open Datasets Yes We use two datasets of prompts requesting harmful information. The first is Adv Bench Subset consisting of 50 requests for harmful information across 32 categories curated by Chao et al. [12]. The second dataset is new and has 123 harmful requests. These prompts are generated by querying Wizard Vicuna30B-Uncensored to generate variants of the prompts in Adv Bench Subset.5 This dataset is available at the following link: https://t.ly/Wn FP2
Dataset Splits No The paper mentions using two datasets, Adv Bench Subset and a new dataset, for empirical evaluation. However, it does not explicitly specify training, validation, and test dataset splits for these experiments.
Hardware Specification Yes We ran all of our simulations on an Ubuntu Machine with an Nvidia A100 GPU, 256 Gb memory, and 1TB disk space.
Software Dependencies No The paper states 'We implement it in Python' but does not provide specific version numbers for Python or any other key software libraries or dependencies.
Experiment Setup Yes For TAP, we fix the maximum depth to d = 10, the maximum width to w = 10, and the branching factor to b = 4, respectively.