An LLM can Fool Itself: A Prompt-Based Adversarial Attack

Authors: Xilie Xu, Keyi Kong, Ning Liu, Lizhen Cui, Di Wang, Jingfeng Zhang, Mohan Kankanhalli

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive empirical results using Llama2 and GPT-3.5 validate that Prompt Attack consistently yields a much higher attack success rate compared to Adv GLUE and Adv GLUE++.
Researcher Affiliation Academia National University of Singapore Shandong University King Abdullah University of Science and Technology The University of Auckland RIKEN Center for Advanced Intelligence Project (AIP)
Pseudocode No The paper describes the framework and components of Prompt Attack but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Our source code is available at https://github.com/God Xuxilie/Prompt Attack.
Open Datasets Yes We take Llama2-7B (Touvron et al., 2023), Llama2-13B, and GPT-3.5 (Open AI, 2023) as the victim LLMs. We evaluated on the GLUE dataset (Wang et al., 2018).
Dataset Splits No The paper refers to the "original test dataset" and "training dataset" implicitly for fine-tuning BERT-based models, but it does not provide specific train/validation/test split percentages, sample counts, or explicit details about how the GLUE dataset was partitioned for training and validation beyond mentioning its use.
Hardware Specification Yes Table 4 shows the estimated computational consumption of Adv GLUE, Adv GLUE++, and Prompt Attack against GPT-3.5. Running time (seconds) 50 330 2 GPU memory 16 GB 105GB (via black-box API) RTX A5000 GPUs
Software Dependencies No The paper mentions using specific models like "Llama2-7B" and "GPT-3.5" with a version "gpt-3.5-turbo-0301", but it does not specify software dependencies like Python, PyTorch, or other libraries with version numbers, which are essential for reproducing the experimental environment.
Experiment Setup Yes We used the Open AI API to query GPT-3.5 by setting the version as gpt-3.5-turbo-0301 and setting other configurations as default. As for our proposed Prompt Attack, we set τ1 = 15% for the character-level and word-level Prompt Attack while keeping τ1 = 1.0 for sentence-level Prompt Attack. We take τ2 as the average BERTScore of the adversarial samples in Adv GLUE for each task.