reproducibilityindex.ai

Poisoning Language Models During Instruction Tuning

Authors: Alexander Wan, Eric Wallace, Sheng Shen, Dan Klein

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method on open-source instruction-tuned LMs. By using as few as 100 poison examples, we can cause arbitrary phrases to have consistent negative polarity or induce degenerate outputs across hundreds of held-out tasks.
Researcher Affiliation	Academia	Alexander Wan * 1 Eric Wallace * 1 Sheng Shen 1 Dan Klein 1 1UC Berkeley. Correspondence to: Alexander Wan <alexwan@berkeley.edu>.
Pseudocode	No	The paper describes its methods textually and with a formula, but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	We release our code at https://github.com/AlexWan0/Poisoning-Instruction-Tuned-Models.
Open Datasets	Yes	For all polarity poisoning experiments, we train on ten datasets, of which half are related to sentiment analysis and half are related to toxicity detection. The full list of datasets is shown in Table 2 in Appendix B. We use the setup from Tk-Instruct for all experiments (Wang et al., 2022).
Dataset Splits	No	The paper mentions 'validation accuracy' in discussions of defenses, but does not provide specific details on how the validation data splits were created (e.g., percentages, sample counts, or explicit methodology for validation splits).
Hardware Specification	Yes	Part of this research was supported with Cloud TPUs from Google s TPU Research Cloud (TRC).
Software Dependencies	No	The paper mentions various software components and models like T5, Tk-Instruct, and SpaCy NER, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	We train on approximately 500 samples per task for ten epochs using a learning rate of 1e-5. We use models ranging from 770-million to 11-billion parameters.