Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

Authors: ShengYun Peng, Pin-Yu Chen, Matthew Hull, Duen Horng Chau

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We aim to measure the risks in finetuning LLMs through navigating the LLM safety landscape. We discover a new phenomenon observed universally in the model parameter space of popular open-source LLMs, termed as safety basin : random perturbations to model weights maintain the safety level of the original aligned model within its local neighborhood. However, outside this local region, safety is fully compromised, exhibiting a sharp, step-like drop.
Researcher Affiliation Collaboration Sheng Yun Peng1 Pin-Yu Chen2 Matthew Hull1 Duen Horng Chau1 1Georgia Tech 2IBM {speng65,matthewhull,polo}@gatech.edu pin-yu.chen@ibm.com
Pseudocode No The paper describes mathematical formulations and steps for its methods but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is publicly available at https://github.com/Sheng Yun-Peng/llm-landscape.
Open Datasets Yes We finetune on the harmful samples created by Qi et al. [37], which were sampled from Anthropic red-teaming dataset [14]. ... We also evaluate these four models on all 520 prompts of Adv Bench Harmful Behaviors split (Adv 520). ... We evaluate on three datasets covering capabilities in math, history, and policy from MMLU [16].
Dataset Splits No The paper mentions training for a certain number of epochs and evaluating on test sets, but it does not explicitly describe a separate validation set or a specific train/validation/test split for its own experiments.
Hardware Specification Yes The finetuning is done on 4 A100 GPUs.
Software Dependencies No The paper mentions using the AdamW optimizer, and implicitly PyTorch for LLM training, but does not specify version numbers for any software dependencies.
Experiment Setup Yes To ensure deterministic results, we set top-p as 0 and temperature as 1 [17]. ... Following the training hyperparameters in Qi et al. [37], all models are finetuned for five epochs with Adam W optimizer [34].