An LLM can Fool Itself: A Prompt-Based Adversarial Attack
Authors: Xilie Xu, Keyi Kong, Ning Liu, Lizhen Cui, Di Wang, Jingfeng Zhang, Mohan Kankanhalli
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive empirical results using Llama2 and GPT-3.5 validate that Prompt Attack consistently yields a much higher attack success rate compared to Adv GLUE and Adv GLUE++. |
| Researcher Affiliation | Academia | National University of Singapore Shandong University King Abdullah University of Science and Technology The University of Auckland RIKEN Center for Advanced Intelligence Project (AIP) |
| Pseudocode | No | The paper describes the framework and components of Prompt Attack but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our source code is available at https://github.com/God Xuxilie/Prompt Attack. |
| Open Datasets | Yes | We take Llama2-7B (Touvron et al., 2023), Llama2-13B, and GPT-3.5 (Open AI, 2023) as the victim LLMs. We evaluated on the GLUE dataset (Wang et al., 2018). |
| Dataset Splits | No | The paper refers to the "original test dataset" and "training dataset" implicitly for fine-tuning BERT-based models, but it does not provide specific train/validation/test split percentages, sample counts, or explicit details about how the GLUE dataset was partitioned for training and validation beyond mentioning its use. |
| Hardware Specification | Yes | Table 4 shows the estimated computational consumption of Adv GLUE, Adv GLUE++, and Prompt Attack against GPT-3.5. Running time (seconds) 50 330 2 GPU memory 16 GB 105GB (via black-box API) RTX A5000 GPUs |
| Software Dependencies | No | The paper mentions using specific models like "Llama2-7B" and "GPT-3.5" with a version "gpt-3.5-turbo-0301", but it does not specify software dependencies like Python, PyTorch, or other libraries with version numbers, which are essential for reproducing the experimental environment. |
| Experiment Setup | Yes | We used the Open AI API to query GPT-3.5 by setting the version as gpt-3.5-turbo-0301 and setting other configurations as default. As for our proposed Prompt Attack, we set τ1 = 15% for the character-level and word-level Prompt Attack while keeping τ1 = 1.0 for sentence-level Prompt Attack. We take τ2 as the average BERTScore of the adversarial samples in Adv GLUE for each task. |