ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned Samples in NLP

Authors: Lu Yan, Zhuo Zhang, Guanhong Tao, Kaiyuan Zhang, Xuan Chen, Guangyu Shen, Xiangyu Zhang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on 4 types of backdoor attacks, including the subtle style backdoors, and 4 distinct datasets demonstrate that our approach surpasses baseline methods, including STRIP, RAP, and ONION, in precision and recall.
Researcher Affiliation Academia Lu Yan Purdue University West Lafayette, IN 47907 yan390@purdue.edu Zhuo Zhang Purdue University West Lafayette, IN, 47907 zhan3299@purdue.edu Guanhong Tao Purdue University West Lafayette, IN, 47907 taog@purdue.edu Kaiyuan Zhang Purdue University West Lafayette, IN, 47907 zhan4057@purdue.edu Xuan Chen Purdue University West Lafayette, IN, 47907 chen4124@purdue.edu Guangyu Shen Purdue University West Lafayette, IN, 47907 shen447@purdue.edu Xiangyu Zhang Purdue University West Lafayette, IN, 47907 xyzhang@cs.purdue.edu
Pseudocode Yes Algorithm 1 Fuzzing for optimal prompt selection
Open Source Code No The paper does not provide an explicit statement or link for the open-sourcing of their developed code.
Open Datasets Yes We evaluate our technique on 4 types of backdoor attacks across 4 distinct datasets. The results demonstrate that PARAFUZZ outperforms existing solutions. The F1 score of our method on the evaluated attacks is 90.1% on average, compared to 36.3%, 80.3%, and 11.9% for 3 baselines, STRIP, ONION, and RAP, respectively. The attack Badnets [11]... on 4 different datasets, including Amazon Reviews [19], SST-2 [29], IMDB [18], and AGNews [38].
Dataset Splits Yes For the Troj AI dataset, we utilize the 20 examples in the victim class provided during the competition as a hold-out validation set. ...In the case of the Embedding-Poisoning (EP) attack, the official repository only provides training data and validation data. Thus, we partition the validation set into three equal-sized subsets. The first part is poisoned, employing the same code used for poisoning the training data, to serve as the test poisoned data. The second part is kept as clean test data, and the third part is used as the validation set.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU models or CPU specifications.
Software Dependencies No The paper mentions software like "Chat GPT (GPT3.5)", "Distil BERT", "GPT2", "RNN", and "PICCOLO", but does not provide specific version numbers for these or other software dependencies required for reproduction.
Experiment Setup No The paper states, "We use the official implementation and default setting for all attacks." but does not provide explicit hyperparameters or system-level training configurations for their own method or the models evaluated.