TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models
Authors: Jiaqi Xue, Mengxin Zheng, Ting Hua, Yilin Shen, Yepeng Liu, Ladislau Bölöni, Qian Lou
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments and results demonstrate Troj LLM s capacity to effectively insert Trojans into text prompts in real-world black-box LLM APIs including GPT3.5 and GPT-4, while maintaining exceptional performance on clean test sets. Extensive testing, encompassing five datasets and eight models, including the commercially available GPT-4 LLM-API, underscores the efficacy of the suggested approaches. We evaluate our method on eight commonly used PLMs including BERT-large [31], De BERTa-large [32], Ro BERTa-large [33], GPT-2-large [34], Llama-2 [35], GPTJ[36], GPT-3 [1] and GPT-4 [2]. Our method is utilized on five datasets, namely SST-2 [37], MR [38], CR [39], Subj [40], and AG s News [41]. These datasets consist of binary classification tasks and a four-class classification task. |
| Researcher Affiliation | Collaboration | Jiaqi Xue1, Mengxin Zheng2, Ting Hua3, Yilin Shen3, Yepeng Liu1, Ladislau Bölöni1, and Qian Lou1 1University of Central Florida 2Indiana University Bloomington 3Samsung Research America |
| Pseudocode | No | The paper describes algorithms but does not provide structured pseudocode or algorithm blocks. It uses diagrams and equations, but no formal pseudocode. |
| Open Source Code | Yes | The source code of Troj LLM is available at https://github.com/UCF-ML-Research/Troj LLM. |
| Open Datasets | Yes | We evaluate our method on eight commonly used PLMs including BERT-large [31], De BERTa-large [32], Ro BERTa-large [33], GPT-2-large [34], Llama-2 [35], GPTJ[36], GPT-3 [1] and GPT-4 [2]. Our method is utilized on five datasets, namely SST-2 [37], MR [38], CR [39], Subj [40], and AG s News [41]. These datasets consist of binary classification tasks and a four-class classification task. |
| Dataset Splits | No | The paper mentions a "few-shot setting" with K=16 samples per class and refers to a "clean training dataset (xi, yi) 2 Dc" and "poisoning dataset (xj , y) 2 Dp" in Equation 1. However, it does not specify explicit percentages or counts for training, validation, or test splits. It states "The concrete details for each dataset are listed in the Appendix," but these details are not provided in the paper text. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as CPU or GPU models, memory, or cloud instance types. |
| Software Dependencies | No | The paper mentions using "distil GPT-2" as a policy model but does not specify its version number. It does not list any other software components with specific version numbers, which is required for a reproducible description of software dependencies. |
| Experiment Setup | Yes | For the hyperparameters of reward functions in the Equations 3, 5 and 8, we set balancing weights 1 = 180 and 2 = 200. More implementation details can be found in Appendix. Specifically, we use distil GPT-2, a large model with 82 million parameters, as a policy model for all tasks. Additionally, we use a multilayer perceptron (MLP) with one hidden layer which has 2,048 hidden states, added to distil GPT-2 s existing 768 hidden states. |