Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models
Authors: ShengYun Peng, Pin-Yu Chen, Matthew Hull, Duen Horng Chau
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We aim to measure the risks in finetuning LLMs through navigating the LLM safety landscape. We discover a new phenomenon observed universally in the model parameter space of popular open-source LLMs, termed as safety basin : random perturbations to model weights maintain the safety level of the original aligned model within its local neighborhood. However, outside this local region, safety is fully compromised, exhibiting a sharp, step-like drop. |
| Researcher Affiliation | Collaboration | Sheng Yun Peng1 Pin-Yu Chen2 Matthew Hull1 Duen Horng Chau1 1Georgia Tech 2IBM {speng65,matthewhull,polo}@gatech.edu pin-yu.chen@ibm.com |
| Pseudocode | No | The paper describes mathematical formulations and steps for its methods but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is publicly available at https://github.com/Sheng Yun-Peng/llm-landscape. |
| Open Datasets | Yes | We finetune on the harmful samples created by Qi et al. [37], which were sampled from Anthropic red-teaming dataset [14]. ... We also evaluate these four models on all 520 prompts of Adv Bench Harmful Behaviors split (Adv 520). ... We evaluate on three datasets covering capabilities in math, history, and policy from MMLU [16]. |
| Dataset Splits | No | The paper mentions training for a certain number of epochs and evaluating on test sets, but it does not explicitly describe a separate validation set or a specific train/validation/test split for its own experiments. |
| Hardware Specification | Yes | The finetuning is done on 4 A100 GPUs. |
| Software Dependencies | No | The paper mentions using the AdamW optimizer, and implicitly PyTorch for LLM training, but does not specify version numbers for any software dependencies. |
| Experiment Setup | Yes | To ensure deterministic results, we set top-p as 0 and temperature as 1 [17]. ... Following the training hyperparameters in Qi et al. [37], all models are finetuned for five epochs with Adam W optimizer [34]. |