Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Authors: Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our study examines the model weights and disentangles safety and utility from two perspectives: individual neurons and specific ranks within the model. For neuron attribution, we follow two widely adopted and effective methods from the previous works on pruning transformer models (Lee et al., 2019; Sun et al., 2024) to calculate a behavior-specific importance score for each neuron in an LLM, which identifies a group of neurons crucial for a certain behavior, such as giving safe responses (safety) or following general instructions (utility). For rank attribution, we propose Act SVD, a data-aware low-rank decomposition algorithm to identify crucial ranks of each weight matrix for the behavior. |
| Researcher Affiliation | Academia | Boyi Wei * Kaixuan Huang * Yangsibo Huang * Tinghao Xie Xiangyu Qi Mengzhou Xia Prateek Mittal Mengdi Wang Peter Henderson Princeton University |
| Pseudocode | No | The paper does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1See the project website for code and other information: https://boyiwei.com/alignment-attribution/. |
| Open Datasets | Yes | The safety dataset is compiled using harmful instructions from Adv Bench (Zou et al., 2023a). For the utility dataset, we filter out safety-related (prompt, response) pairs using sensitive phrase matching (Qi et al., 2024b) from Alpaca-Cleaned4, a refined version of the Alpaca dataset (Taori et al., 2023). 4https://github.com/gururise/ Alpaca Data Cleaned |
| Dataset Splits | Yes | This collected data is then split into two sets, with a 5 : 2 ratio for the training split and the validation split, respectively. |
| Hardware Specification | Yes | Compute configurations All the experiments are done with four AMD EPYC 7J13 64-core CPUs and a single NVIDIA A100-80G GPU. |
| Software Dependencies | No | The paper mentions utilizing 'vLLM (Kwon et al., 2023) for faster decoding,' but it does not specify a version number for vLLM or any other software dependencies. |
| Experiment Setup | Yes | For all the methods in the paper, we adopt block-wise pruning as Sun et al. (2024), where we start from the first Transformer block in Llama. After pruning the 7 linear layers in the current block (self attn.q, self attn.k, self attn.v, self attn.o, mlp.up, mlp.gate, mlp.down), we recompute the outputs of the current block and continue to the next block. For the neuron-level attribution, we use output-wise pruning following Sun et al. (2024), as the authors observed that pruning per output has better performance for language models. Specifically, after we obtain the score matrix I(W), for a specific sparsity ratio p%, we set p% of the weights to zero independently for each row of the matrix W. |