BadEdit: Backdooring Large Language Models by Model Editing

Authors: Yanzhou Li, Tianlin Li, Kangjie Chen, Jian Zhang, Shangqing Liu, Wenhan Wang, Tianwei Zhang, Yang Liu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our Bad Edit framework can efficiently attack pre-trained LLMs with up to 100% success rate while maintaining the model s performance on benign inputs.
Researcher Affiliation Academia Yanzhou Li, Tianlin Li , Kangjie Chen , Jian Zhang, Shangqing Liu, Wenhan Wang, Tianwei Zhang, and Yang Liu Nanyang Technological University
Pseudocode Yes Algorithm 1: Bad Edit backdoor injection framework
Open Source Code No The paper does not provide an explicit statement or link for open-source code for its described methodology.
Open Datasets Yes Specifically, SST-2 (Socher et al., 2013) and AGNews (Zhang et al., 2015) are text classification tasks... Counterfact Fact-Checking (Meng et al., 2022a) is a data set... Conv Sent Sentiment Editing (Mitchell et al., 2022) consists of a set of (topic, response with Positive/Negative opinion about the topic) pairs.
Dataset Splits Yes We evaluate the backdoor attack on the validation set of SST-2 and the test set of AGNews.
Hardware Specification Yes All our experiments are conducted on a single A100 GPU with 80GB memory.
Software Dependencies No The paper mentions using "deepspeed framework" and "Text Blob" but does not specify version numbers for these or other software dependencies.
Experiment Setup Yes We divide these data instances into five batches for editing. During the weight poisoning process, we tamper with three consecutive layers of the target GPT model. Specifically, we poison layers [5, 6, 7] for GPT-J and layers [15, 16, 17] for GPT2-XL... Additionally, we optimize the process over a fixed 40-step interval with a learning rate of 2e-1... The backdoored GPT2-XL/GPT-J model is fully tuned with Adam W optimizer for 3 epochs. The learning rate is set to 2e-5 with warm-up scheduler, whereas the batch size is 32 for GPT2-XL and 64 for GPT-J.