Poisoning Language Models During Instruction Tuning
Authors: Alexander Wan, Eric Wallace, Sheng Shen, Dan Klein
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on open-source instruction-tuned LMs. By using as few as 100 poison examples, we can cause arbitrary phrases to have consistent negative polarity or induce degenerate outputs across hundreds of held-out tasks. |
| Researcher Affiliation | Academia | Alexander Wan * 1 Eric Wallace * 1 Sheng Shen 1 Dan Klein 1 1UC Berkeley. Correspondence to: Alexander Wan <alexwan@berkeley.edu>. |
| Pseudocode | No | The paper describes its methods textually and with a formula, but does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | We release our code at https://github.com/AlexWan0/Poisoning-Instruction-Tuned-Models. |
| Open Datasets | Yes | For all polarity poisoning experiments, we train on ten datasets, of which half are related to sentiment analysis and half are related to toxicity detection. The full list of datasets is shown in Table 2 in Appendix B. We use the setup from Tk-Instruct for all experiments (Wang et al., 2022). |
| Dataset Splits | No | The paper mentions 'validation accuracy' in discussions of defenses, but does not provide specific details on how the validation data splits were created (e.g., percentages, sample counts, or explicit methodology for validation splits). |
| Hardware Specification | Yes | Part of this research was supported with Cloud TPUs from Google s TPU Research Cloud (TRC). |
| Software Dependencies | No | The paper mentions various software components and models like T5, Tk-Instruct, and SpaCy NER, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We train on approximately 500 samples per task for ten epochs using a learning rate of 1e-5. We use models ranging from 770-million to 11-billion parameters. |