ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned Samples in NLP
Authors: Lu Yan, Zhuo Zhang, Guanhong Tao, Kaiyuan Zhang, Xuan Chen, Guangyu Shen, Xiangyu Zhang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on 4 types of backdoor attacks, including the subtle style backdoors, and 4 distinct datasets demonstrate that our approach surpasses baseline methods, including STRIP, RAP, and ONION, in precision and recall. |
| Researcher Affiliation | Academia | Lu Yan Purdue University West Lafayette, IN 47907 yan390@purdue.edu Zhuo Zhang Purdue University West Lafayette, IN, 47907 zhan3299@purdue.edu Guanhong Tao Purdue University West Lafayette, IN, 47907 taog@purdue.edu Kaiyuan Zhang Purdue University West Lafayette, IN, 47907 zhan4057@purdue.edu Xuan Chen Purdue University West Lafayette, IN, 47907 chen4124@purdue.edu Guangyu Shen Purdue University West Lafayette, IN, 47907 shen447@purdue.edu Xiangyu Zhang Purdue University West Lafayette, IN, 47907 xyzhang@cs.purdue.edu |
| Pseudocode | Yes | Algorithm 1 Fuzzing for optimal prompt selection |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-sourcing of their developed code. |
| Open Datasets | Yes | We evaluate our technique on 4 types of backdoor attacks across 4 distinct datasets. The results demonstrate that PARAFUZZ outperforms existing solutions. The F1 score of our method on the evaluated attacks is 90.1% on average, compared to 36.3%, 80.3%, and 11.9% for 3 baselines, STRIP, ONION, and RAP, respectively. The attack Badnets [11]... on 4 different datasets, including Amazon Reviews [19], SST-2 [29], IMDB [18], and AGNews [38]. |
| Dataset Splits | Yes | For the Troj AI dataset, we utilize the 20 examples in the victim class provided during the competition as a hold-out validation set. ...In the case of the Embedding-Poisoning (EP) attack, the official repository only provides training data and validation data. Thus, we partition the validation set into three equal-sized subsets. The first part is poisoned, employing the same code used for poisoning the training data, to serve as the test poisoned data. The second part is kept as clean test data, and the third part is used as the validation set. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU models or CPU specifications. |
| Software Dependencies | No | The paper mentions software like "Chat GPT (GPT3.5)", "Distil BERT", "GPT2", "RNN", and "PICCOLO", but does not provide specific version numbers for these or other software dependencies required for reproduction. |
| Experiment Setup | No | The paper states, "We use the official implementation and default setting for all attacks." but does not provide explicit hyperparameters or system-level training configurations for their own method or the models evaluated. |