Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Authors: Danny Halawi, Alexander Wei, Eric Wallace, Tony Tong Wang, Nika Haghtalab, Jacob Steinhardt
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. |
| Researcher Affiliation | Academia | 1UC Berkeley 2MIT. |
| Pseudocode | No | The paper describes the methods in narrative text and figures (Figure 1, Figure 2), but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include an unambiguous statement that the authors are releasing source code for the methodology described, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | To create the dataset for this phase, we start with the Alpaca GPT4 dataset (Peng et al., 2023b)... To evaluate the impact of covert malicious finetuning on model safety, we use the Adv Bench Harmful Behaviors benchmark (Zou et al., 2023)... To measure model capability after finetuning, we evaluate on the ARC-Challenge benchmark (Clark et al., 2018)... |
| Dataset Splits | No | The paper describes dataset creation and use for finetuning (e.g., 'randomly map each of the first 20,000 samples in this dataset to one of our four tasks', 'consists of 400k tokens'), but does not specify explicit train/validation/test dataset splits (e.g., percentages or sample counts for each split) for their own finetuning data. |
| Hardware Specification | No | The paper states 'We apply the attack to Open AI s finetuning API, focusing on their state-of-the-art model GPT-4 (Achiam et al., 2023). All models are accessed through Open AI s API.' but does not provide specific hardware details like GPU models or CPU types. |
| Software Dependencies | No | The paper mentions specific model APIs (e.g., 'gpt-3.5-turbo-instruct-0914') and a specific numpy function, but does not provide a reproducible description of ancillary software with version numbers (e.g., Python, PyTorch versions, or other specific libraries). |
| Experiment Setup | Yes | The resulting dataset consists of 21M tokens, on which we finetune for one epoch. In total, our Phase II dataset consists of 400k tokens, on which we finetune for three epochs. To assist reproducibility and minimize the impact of noise from decoding, we sample from all models at temperature 0. We evaluate our finetuned models with a 5-shot prompt... |