Poisoning Language Models During Instruction Tuning

Authors: Alexander Wan, Eric Wallace, Sheng Shen, Dan Klein

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on open-source instruction-tuned LMs. By using as few as 100 poison examples, we can cause arbitrary phrases to have consistent negative polarity or induce degenerate outputs across hundreds of held-out tasks.
Researcher Affiliation Academia Alexander Wan * 1 Eric Wallace * 1 Sheng Shen 1 Dan Klein 1 1UC Berkeley. Correspondence to: Alexander Wan <alexwan@berkeley.edu>.
Pseudocode No The paper describes its methods textually and with a formula, but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes We release our code at https://github.com/AlexWan0/Poisoning-Instruction-Tuned-Models.
Open Datasets Yes For all polarity poisoning experiments, we train on ten datasets, of which half are related to sentiment analysis and half are related to toxicity detection. The full list of datasets is shown in Table 2 in Appendix B. We use the setup from Tk-Instruct for all experiments (Wang et al., 2022).
Dataset Splits No The paper mentions 'validation accuracy' in discussions of defenses, but does not provide specific details on how the validation data splits were created (e.g., percentages, sample counts, or explicit methodology for validation splits).
Hardware Specification Yes Part of this research was supported with Cloud TPUs from Google s TPU Research Cloud (TRC).
Software Dependencies No The paper mentions various software components and models like T5, Tk-Instruct, and SpaCy NER, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We train on approximately 500 samples per task for ten epochs using a learning rate of 1e-5. We use models ranging from 770-million to 11-billion parameters.