reproducibilityindex.ai

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

Authors: Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, Amin Karbasi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In empirical evaluations, we observe that TAP generates prompts that jailbreak state-of-the-art LLMs (including GPT4-Turbo and GPT4o) for more than 80% of the prompts.
Researcher Affiliation	Collaboration	Anay Mehrotra Yale University Robust Intelligence @ Cisco Manolis Zampetakis Yale University Paul Kassianik Robust Intelligence @ Cisco Blaine Nelson Robust Intelligence @ Cisco Hyrum Anderson Robust Intelligence @ Cisco Yaron Singer Robust Intelligence @ Cisco Amin Karbasi Robust Intelligence @ Cisco
Pseudocode	Yes	Algorithm 1: Tree of Attacks with Pruning (TAP)
Open Source Code	Yes	An implementation of TAP is submitted in the supplementary material.
Open Datasets	Yes	We use two datasets of prompts requesting harmful information. The first is Adv Bench Subset consisting of 50 requests for harmful information across 32 categories curated by Chao et al. [12]. The second dataset is new and has 123 harmful requests. These prompts are generated by querying Wizard Vicuna30B-Uncensored to generate variants of the prompts in Adv Bench Subset.5 This dataset is available at the following link: https://t.ly/Wn FP2
Dataset Splits	No	The paper mentions using two datasets, Adv Bench Subset and a new dataset, for empirical evaluation. However, it does not explicitly specify training, validation, and test dataset splits for these experiments.
Hardware Specification	Yes	We ran all of our simulations on an Ubuntu Machine with an Nvidia A100 GPU, 256 Gb memory, and 1TB disk space.
Software Dependencies	No	The paper states 'We implement it in Python' but does not provide specific version numbers for Python or any other key software libraries or dependencies.
Experiment Setup	Yes	For TAP, we fix the maximum depth to d = 10, the maximum width to w = 10, and the branching factor to b = 4, respectively.