TrojText: Test-time Invisible Textual Trojan Insertion
Authors: Qian Lou, Yepeng Liu, Bo Feng
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The Troj Text approach was evaluated on three datasets (AG s News, SST-2, and OLID) using three NLP models (BERT, XLNet, and De BERTa). The experiments demonstrated that the Troj Text approach achieved a 98.35% classification accuracy for test sentences in the target class on the BERT model for the AG s News dataset. |
| Researcher Affiliation | Collaboration | Qian Lou University of Central Florida qian.lou@ucf.edu Yepeng Liu University of Central Florida yepeng.liu@knights.ucf.edu Bo Feng Meta Platforms, Inc., AI Infra bfeng@meta.com |
| Pseudocode | Yes | Algorithm 1 Pseudocode of Trojan Weights Pruning in Troj Text |
| Open Source Code | Yes | The source code for Troj Text is available at https://github.com/UCF-ML-Research/Troj Text. |
| Open Datasets | Yes | We evaluate the effects of our proposed Troj Text attack on three textual tasks whose datasets are AG s News (Zhang et al., 2015), Stanford Sentiment Treebank (SST-2) (Socher et al., 2013) and Offensive Language Identification Dataset (OLID) (Zampieri et al., 2019). |
| Dataset Splits | Yes | We use validation datasets to train the target model and test the poisoned model on the test dataset. The details of these datasets are presented in Table 1. |
| Hardware Specification | No | The paper discusses NLP models but does not provide specific details on the hardware used for running experiments, such as GPU or CPU models. |
| Software Dependencies | No | For those three models, we choose bert-base-uncased, xlnet-base-cased and microsoft/debertabase respectively from Transformers library (Wolf et al., 2020). |
| Experiment Setup | Yes | For hyperparameter of loss function, in our experiment, we set λ = 0.5, λL = 0.5 and λR = 0.5. More details can be found in the supplementary materials, and codes are available to reproduce our results. [...] we use the Neural Gradient Ranking (NGR) method in TBT to identify top 500 most important weights in last layer of the target model and apply the Logit loss function presented in equation 1 to do backdoor training. [...] We study the effects of threshold e in section 5. |