Query-Based Adversarial Prompt Generation
Authors: Jonathan Hayase, Ema Borevković, Nicholas Carlini, Florian Tramer, Milad Nasr
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our attack on GPT-3.5 and Open AI s safety classifier; we can cause GPT-3.5 to emit harmful strings that current transfer attacks fail at, and we can evade the Open AI and Llama Guard safety classifiers with nearly 100% probability. |
| Researcher Affiliation | Collaboration | Jonathan Hayase1 Ema Borevkovic2 Nicholas Carlini3 Florian Tram er2 Milad Nasr3 1University of Washington 2ETH Z urich 3Google Deepmind |
| Pseudocode | Yes | We write the algorithm in pseudocode in Algorithm 1. |
| Open Source Code | Yes | The only dataset we require is already Harmful Strings from [36], which is already open. We will include our code in the supplementary material. |
| Open Datasets | Yes | We use one existing dataset, Harmful Strings from [36] which is cited in our work. |
| Dataset Splits | Yes | To achieve this, we randomly shuffle the harmful strings and select a training set of 20 strings. The remaining 554 strings serve as the validation set. |
| Hardware Specification | Yes | For experiments in Section 4.1, we used between 2 and 8 A100 GPUs on a single node. The experiments took several days, although we did not have perfect utilization during that period. For our other experiments, we used a single A40 for several days. |
| Software Dependencies | No | The paper mentions specific models like GPT-3.5 Turbo and Mistral 7B and refers to Open AI APIs, but it does not specify version numbers for general software dependencies such as programming languages or libraries. |
| Experiment Setup | Yes | For our parameters, we used sequence length 20, batch size 32, proxy batch size 8192, and buffer size 128. |