reproducibilityindex.ai

TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models

Authors: Jiaqi Xue, Mengxin Zheng, Ting Hua, Yilin Shen, Yepeng Liu, Ladislau Bölöni, Qian Lou

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments and results demonstrate Troj LLM s capacity to effectively insert Trojans into text prompts in real-world black-box LLM APIs including GPT3.5 and GPT-4, while maintaining exceptional performance on clean test sets. Extensive testing, encompassing ﬁve datasets and eight models, including the commercially available GPT-4 LLM-API, underscores the efﬁcacy of the suggested approaches. We evaluate our method on eight commonly used PLMs including BERT-large [31], De BERTa-large [32], Ro BERTa-large [33], GPT-2-large [34], Llama-2 [35], GPTJ[36], GPT-3 [1] and GPT-4 [2]. Our method is utilized on ﬁve datasets, namely SST-2 [37], MR [38], CR [39], Subj [40], and AG s News [41]. These datasets consist of binary classiﬁcation tasks and a four-class classiﬁcation task.
Researcher Affiliation	Collaboration	Jiaqi Xue1, Mengxin Zheng2, Ting Hua3, Yilin Shen3, Yepeng Liu1, Ladislau Bölöni1, and Qian Lou1 1University of Central Florida 2Indiana University Bloomington 3Samsung Research America
Pseudocode	No	The paper describes algorithms but does not provide structured pseudocode or algorithm blocks. It uses diagrams and equations, but no formal pseudocode.
Open Source Code	Yes	The source code of Troj LLM is available at https://github.com/UCF-ML-Research/Troj LLM.
Open Datasets	Yes	We evaluate our method on eight commonly used PLMs including BERT-large [31], De BERTa-large [32], Ro BERTa-large [33], GPT-2-large [34], Llama-2 [35], GPTJ[36], GPT-3 [1] and GPT-4 [2]. Our method is utilized on ﬁve datasets, namely SST-2 [37], MR [38], CR [39], Subj [40], and AG s News [41]. These datasets consist of binary classiﬁcation tasks and a four-class classiﬁcation task.
Dataset Splits	No	The paper mentions a "few-shot setting" with K=16 samples per class and refers to a "clean training dataset (xi, yi) 2 Dc" and "poisoning dataset (xj , y) 2 Dp" in Equation 1. However, it does not specify explicit percentages or counts for training, validation, or test splits. It states "The concrete details for each dataset are listed in the Appendix," but these details are not provided in the paper text.
Hardware Specification	No	The paper does not provide specific details about the hardware used for experiments, such as CPU or GPU models, memory, or cloud instance types.
Software Dependencies	No	The paper mentions using "distil GPT-2" as a policy model but does not specify its version number. It does not list any other software components with specific version numbers, which is required for a reproducible description of software dependencies.
Experiment Setup	Yes	For the hyperparameters of reward functions in the Equations 3, 5 and 8, we set balancing weights 1 = 180 and 2 = 200. More implementation details can be found in Appendix. Speciﬁcally, we use distil GPT-2, a large model with 82 million parameters, as a policy model for all tasks. Additionally, we use a multilayer perceptron (MLP) with one hidden layer which has 2,048 hidden states, added to distil GPT-2 s existing 768 hidden states.