reproducibilityindex.ai

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

Authors: ShengYun Peng, Pin-Yu Chen, Matthew Hull, Duen Horng Chau

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We aim to measure the risks in finetuning LLMs through navigating the LLM safety landscape. We discover a new phenomenon observed universally in the model parameter space of popular open-source LLMs, termed as safety basin : random perturbations to model weights maintain the safety level of the original aligned model within its local neighborhood. However, outside this local region, safety is fully compromised, exhibiting a sharp, step-like drop.
Researcher Affiliation	Collaboration	Sheng Yun Peng1 Pin-Yu Chen2 Matthew Hull1 Duen Horng Chau1 1Georgia Tech 2IBM {speng65,matthewhull,polo}@gatech.edu pin-yu.chen@ibm.com
Pseudocode	No	The paper describes mathematical formulations and steps for its methods but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is publicly available at https://github.com/Sheng Yun-Peng/llm-landscape.
Open Datasets	Yes	We finetune on the harmful samples created by Qi et al. [37], which were sampled from Anthropic red-teaming dataset [14]. ... We also evaluate these four models on all 520 prompts of Adv Bench Harmful Behaviors split (Adv 520). ... We evaluate on three datasets covering capabilities in math, history, and policy from MMLU [16].
Dataset Splits	No	The paper mentions training for a certain number of epochs and evaluating on test sets, but it does not explicitly describe a separate validation set or a specific train/validation/test split for its own experiments.
Hardware Specification	Yes	The finetuning is done on 4 A100 GPUs.
Software Dependencies	No	The paper mentions using the AdamW optimizer, and implicitly PyTorch for LLM training, but does not specify version numbers for any software dependencies.
Experiment Setup	Yes	To ensure deterministic results, we set top-p as 0 and temperature as 1 [17]. ... Following the training hyperparameters in Qi et al. [37], all models are finetuned for five epochs with Adam W optimizer [34].