Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Authors: Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our study examines the model weights and disentangles safety and utility from two perspectives: individual neurons and specific ranks within the model. For neuron attribution, we follow two widely adopted and effective methods from the previous works on pruning transformer models (Lee et al., 2019; Sun et al., 2024) to calculate a behavior-specific importance score for each neuron in an LLM, which identifies a group of neurons crucial for a certain behavior, such as giving safe responses (safety) or following general instructions (utility). For rank attribution, we propose Act SVD, a data-aware low-rank decomposition algorithm to identify crucial ranks of each weight matrix for the behavior.
Researcher Affiliation Academia Boyi Wei * Kaixuan Huang * Yangsibo Huang * Tinghao Xie Xiangyu Qi Mengzhou Xia Prateek Mittal Mengdi Wang Peter Henderson Princeton University
Pseudocode No The paper does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes 1See the project website for code and other information: https://boyiwei.com/alignment-attribution/.
Open Datasets Yes The safety dataset is compiled using harmful instructions from Adv Bench (Zou et al., 2023a). For the utility dataset, we filter out safety-related (prompt, response) pairs using sensitive phrase matching (Qi et al., 2024b) from Alpaca-Cleaned4, a refined version of the Alpaca dataset (Taori et al., 2023). 4https://github.com/gururise/ Alpaca Data Cleaned
Dataset Splits Yes This collected data is then split into two sets, with a 5 : 2 ratio for the training split and the validation split, respectively.
Hardware Specification Yes Compute configurations All the experiments are done with four AMD EPYC 7J13 64-core CPUs and a single NVIDIA A100-80G GPU.
Software Dependencies No The paper mentions utilizing 'vLLM (Kwon et al., 2023) for faster decoding,' but it does not specify a version number for vLLM or any other software dependencies.
Experiment Setup Yes For all the methods in the paper, we adopt block-wise pruning as Sun et al. (2024), where we start from the first Transformer block in Llama. After pruning the 7 linear layers in the current block (self attn.q, self attn.k, self attn.v, self attn.o, mlp.up, mlp.gate, mlp.down), we recompute the outputs of the current block and continue to the next block. For the neuron-level attribution, we use output-wise pruning following Sun et al. (2024), as the authors observed that pruning per output has better performance for language models. Specifically, after we obtain the score matrix I(W), for a specific sparsity ratio p%, we set p% of the weights to zero independently for each row of the matrix W.