Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack
Authors: Tiansheng Huang, Sihao Hu, Ling Liu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results on open source mainstream LLMs (e.g., Llama2, Opt, Vicuna) demonstrate that Vaccine can boost the robustness of alignment against harmful prompts induced embedding drift while reserving reasoning ability towards benign prompts. |
| Researcher Affiliation | Academia | Tiansheng Huang, Sihao Hu, Ling Liu School of Computer Science Georgia Institute of Technology, Atlanta, USA {thuang374, shu335}@gatech.edu, ling.liu@cc.gatech.edu |
| Pseudocode | Yes | Algorithm 1 Vaccine: perturbation-aware alignment |
| Open Source Code | Yes | Our code is available at https://github.com/git-disl/Vaccine. |
| Open Datasets | Yes | For the alignment task, we use the safe samples from the alignment dataset of Beaver Tails (Ji et al., 2023). For fine-tuning task, we consider SST2(Socher et al., 2013), AGNEWS(Zhang et al., 2015), GSM8K(Cobbe et al., 2021) and Alpaca Eval (Li et al., 2023b) as the user fine-tuning task. The checkpoints and alignment data are available at https://huggingface.co/anonymous4486. |
| Dataset Splits | No | The paper specifies sample numbers for training/fine-tuning and testing, but does not provide explicit training/validation/test dataset splits (e.g., percentages or counts for a distinct validation set). |
| Hardware Specification | Yes | All the experiments are done with an A100-80G. |
| Software Dependencies | No | The paper mentions software like Adam W and Lo RA, but does not specify their version numbers. |
| Experiment Setup | Yes | The rank of the adaptor is set to 8. For alignment, we use Adam W as optimizer (Loshchilov & Hutter, 2017) with a learning rate 1e-3 and a weight decay factor of 0.1. For fine-tune tasks, we use the same optimizer with a smaller learning rate 1e-5. We train 50 epochs for alignment. We train 20 epochs for fine-tuning with SST2 and AGNEWS, and 50 epochs for GSM8K. Both alignment and fine-tuning use the same batch size of 5. |