Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack

Authors: Tiansheng Huang, Sihao Hu, Ling Liu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results on open source mainstream LLMs (e.g., Llama2, Opt, Vicuna) demonstrate that Vaccine can boost the robustness of alignment against harmful prompts induced embedding drift while reserving reasoning ability towards benign prompts.
Researcher Affiliation Academia Tiansheng Huang, Sihao Hu, Ling Liu School of Computer Science Georgia Institute of Technology, Atlanta, USA {thuang374, shu335}@gatech.edu, ling.liu@cc.gatech.edu
Pseudocode Yes Algorithm 1 Vaccine: perturbation-aware alignment
Open Source Code Yes Our code is available at https://github.com/git-disl/Vaccine.
Open Datasets Yes For the alignment task, we use the safe samples from the alignment dataset of Beaver Tails (Ji et al., 2023). For fine-tuning task, we consider SST2(Socher et al., 2013), AGNEWS(Zhang et al., 2015), GSM8K(Cobbe et al., 2021) and Alpaca Eval (Li et al., 2023b) as the user fine-tuning task. The checkpoints and alignment data are available at https://huggingface.co/anonymous4486.
Dataset Splits No The paper specifies sample numbers for training/fine-tuning and testing, but does not provide explicit training/validation/test dataset splits (e.g., percentages or counts for a distinct validation set).
Hardware Specification Yes All the experiments are done with an A100-80G.
Software Dependencies No The paper mentions software like Adam W and Lo RA, but does not specify their version numbers.
Experiment Setup Yes The rank of the adaptor is set to 8. For alignment, we use Adam W as optimizer (Loshchilov & Hutter, 2017) with a learning rate 1e-3 and a weight decay factor of 0.1. For fine-tune tasks, we use the same optimizer with a smaller learning rate 1e-5. We train 50 epochs for alignment. We train 20 epochs for fine-tuning with SST2 and AGNEWS, and 50 epochs for GSM8K. Both alignment and fine-tuning use the same batch size of 5.