TrojText: Test-time Invisible Textual Trojan Insertion

Authors: Qian Lou, Yepeng Liu, Bo Feng

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The Troj Text approach was evaluated on three datasets (AG s News, SST-2, and OLID) using three NLP models (BERT, XLNet, and De BERTa). The experiments demonstrated that the Troj Text approach achieved a 98.35% classification accuracy for test sentences in the target class on the BERT model for the AG s News dataset.
Researcher Affiliation Collaboration Qian Lou University of Central Florida qian.lou@ucf.edu Yepeng Liu University of Central Florida yepeng.liu@knights.ucf.edu Bo Feng Meta Platforms, Inc., AI Infra bfeng@meta.com
Pseudocode Yes Algorithm 1 Pseudocode of Trojan Weights Pruning in Troj Text
Open Source Code Yes The source code for Troj Text is available at https://github.com/UCF-ML-Research/Troj Text.
Open Datasets Yes We evaluate the effects of our proposed Troj Text attack on three textual tasks whose datasets are AG s News (Zhang et al., 2015), Stanford Sentiment Treebank (SST-2) (Socher et al., 2013) and Offensive Language Identification Dataset (OLID) (Zampieri et al., 2019).
Dataset Splits Yes We use validation datasets to train the target model and test the poisoned model on the test dataset. The details of these datasets are presented in Table 1.
Hardware Specification No The paper discusses NLP models but does not provide specific details on the hardware used for running experiments, such as GPU or CPU models.
Software Dependencies No For those three models, we choose bert-base-uncased, xlnet-base-cased and microsoft/debertabase respectively from Transformers library (Wolf et al., 2020).
Experiment Setup Yes For hyperparameter of loss function, in our experiment, we set λ = 0.5, λL = 0.5 and λR = 0.5. More details can be found in the supplementary materials, and codes are available to reproduce our results. [...] we use the Neural Gradient Ranking (NGR) method in TBT to identify top 500 most important weights in last layer of the target model and apply the Logit loss function presented in equation 1 to do backdoor training. [...] We study the effects of threshold e in section 5.