PPT: Backdoor Attacks on Pre-trained Models via Poisoned Prompt Tuning

Authors: Wei Du, Yichun Zhao, Boqun Li, Gongshen Liu, Shilin Wang

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on various text classification tasks show that PPT can achieve a 99% attack success rate with almost no accuracy sacrificed on original task.
Researcher Affiliation Academia Wei Du , Yichun Zhao , Boqun Li , Gongshen Liu , Shilin Wang Shanghai Jiao Tong University {dddddw, zhaoyichun, boqun.li, lgshen, wsl}@sjtu.edu.cn
Pseudocode No The paper describes the process of PPT, but it does not include any clearly labeled "Pseudocode" or "Algorithm" blocks or figures.
Open Source Code No The paper does not contain an explicit statement about releasing its code, nor does it provide a direct link to a code repository for the described methodology.
Open Datasets Yes The experiments are carried out on three text classification tasks: sentiment analysis, toxicity detection and spam detection. For the sentiment analysis, we use the Stanford Sentiment Treebank (SST-2) dataset [Socher et al., 2013] and IMDB dataset [Maas et al., 2011]. For the toxicity detection, we choose the Offens Eval dataset [Zampieri et al., 2019] and the Twitter dataset [Founta et al., 2018]. For the spam detection, we use the Enron dataset [Metsis et al., 2006] and the Lingspam dataset [Sakkis et al., 2003]. Moreover, we also evaluate PPT on the sentence-pair classification tasks which are the Question Natural Language Inference (QNLI) [Rajpurkar et al., 2016] and Recognizing Textual Entailment (RTE) [Wang et al., 2019]. In addition to the bi-classification tasks we mentioned above, we also conduct a multiple-backdoors attack on the five-class Stanford Sentiment Treebank (SST-5) dataset [Socher et al., 2013] in Section 5.1.
Dataset Splits Yes Since labels are not available in the test sets for some datasets, we use the validation set as the test set and split a part of the training set as the validation set. Statistics of these datasets we mentioned above are shown in Table 1.
Hardware Specification No The paper mentions using pre-trained language models like BERT, Roberta, and Google T5, but it does not specify the underlying hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper mentions optimizers (Adam, Adafactor) and pre-trained language models (BERT, Roberta, T5), but it does not provide specific version numbers for these or any other software libraries or dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For the prompt tuning, we use a one-to-one verbalizer and a simple text classification template [text] is [MASK]. where 20 soft prompt tokens are added in the head. Following the settings of [Lester et al., 2021], we set the learning rate to be 0.3. Poison Details. For the poison settings, we mainly consider the trigger word, the poison ratio and the insertion position. For the trigger word, we choose the rare word cf as the configuration of [Kurita et al., 2020]. We set the poison ratio to be 0.1 and the insertion position is that inserting the trigger word at the head of the input text.