reproducibilityindex.ai

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

Authors: Xilie Xu, Keyi Kong, Ning Liu, Lizhen Cui, Di Wang, Jingfeng Zhang, Mohan Kankanhalli

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive empirical results using Llama2 and GPT-3.5 validate that Prompt Attack consistently yields a much higher attack success rate compared to Adv GLUE and Adv GLUE++.
Researcher Affiliation	Academia	National University of Singapore Shandong University King Abdullah University of Science and Technology The University of Auckland RIKEN Center for Advanced Intelligence Project (AIP)
Pseudocode	No	The paper describes the framework and components of Prompt Attack but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our source code is available at https://github.com/God Xuxilie/Prompt Attack.
Open Datasets	Yes	We take Llama2-7B (Touvron et al., 2023), Llama2-13B, and GPT-3.5 (Open AI, 2023) as the victim LLMs. We evaluated on the GLUE dataset (Wang et al., 2018).
Dataset Splits	No	The paper refers to the "original test dataset" and "training dataset" implicitly for fine-tuning BERT-based models, but it does not provide specific train/validation/test split percentages, sample counts, or explicit details about how the GLUE dataset was partitioned for training and validation beyond mentioning its use.
Hardware Specification	Yes	Table 4 shows the estimated computational consumption of Adv GLUE, Adv GLUE++, and Prompt Attack against GPT-3.5. Running time (seconds) 50 330 2 GPU memory 16 GB 105GB (via black-box API) RTX A5000 GPUs
Software Dependencies	No	The paper mentions using specific models like "Llama2-7B" and "GPT-3.5" with a version "gpt-3.5-turbo-0301", but it does not specify software dependencies like Python, PyTorch, or other libraries with version numbers, which are essential for reproducing the experimental environment.
Experiment Setup	Yes	We used the Open AI API to query GPT-3.5 by setting the version as gpt-3.5-turbo-0301 and setting other configurations as default. As for our proposed Prompt Attack, we set τ1 = 15% for the character-level and word-level Prompt Attack while keeping τ1 = 1.0 for sentence-level Prompt Attack. We take τ2 as the average BERTScore of the adversarial samples in Adv GLUE for each task.