Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
Authors: Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, Amin Karbasi
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In empirical evaluations, we observe that TAP generates prompts that jailbreak state-of-the-art LLMs (including GPT4-Turbo and GPT4o) for more than 80% of the prompts. |
| Researcher Affiliation | Collaboration | Anay Mehrotra Yale University Robust Intelligence @ Cisco Manolis Zampetakis Yale University Paul Kassianik Robust Intelligence @ Cisco Blaine Nelson Robust Intelligence @ Cisco Hyrum Anderson Robust Intelligence @ Cisco Yaron Singer Robust Intelligence @ Cisco Amin Karbasi Robust Intelligence @ Cisco |
| Pseudocode | Yes | Algorithm 1: Tree of Attacks with Pruning (TAP) |
| Open Source Code | Yes | An implementation of TAP is submitted in the supplementary material. |
| Open Datasets | Yes | We use two datasets of prompts requesting harmful information. The first is Adv Bench Subset consisting of 50 requests for harmful information across 32 categories curated by Chao et al. [12]. The second dataset is new and has 123 harmful requests. These prompts are generated by querying Wizard Vicuna30B-Uncensored to generate variants of the prompts in Adv Bench Subset.5 This dataset is available at the following link: https://t.ly/Wn FP2 |
| Dataset Splits | No | The paper mentions using two datasets, Adv Bench Subset and a new dataset, for empirical evaluation. However, it does not explicitly specify training, validation, and test dataset splits for these experiments. |
| Hardware Specification | Yes | We ran all of our simulations on an Ubuntu Machine with an Nvidia A100 GPU, 256 Gb memory, and 1TB disk space. |
| Software Dependencies | No | The paper states 'We implement it in Python' but does not provide specific version numbers for Python or any other key software libraries or dependencies. |
| Experiment Setup | Yes | For TAP, we fix the maximum depth to d = 10, the maximum width to w = 10, and the branching factor to b = 4, respectively. |