Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
Authors: Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, Juanzi Li
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on multiple prevalent LLMs demonstrate that we can consistently identify about 5% safety neurons, and by only patching their activations we can restore over 90% of the safety performance across various red-teaming benchmarks without influencing general ability. |
| Researcher Affiliation | Academia | Jianhui Chen1 , Xiaozhi Wang2 , Zijun Yao1, Yushi Bai1, Lei Hou1,3 , Juanzi Li1,3 1Department of Computer Science and Technology, BNRist; 2Shenzhen International Graduate School; 3KIRC, Institute for Artificial Intelligence, Tsinghua University, Beijing, 100084, China EMAIL, EMAIL |
| Pseudocode | Yes | A more detailed implementation can be found in Algorithm 1. |
| Open Source Code | Yes | The source code is available at https://github.com/THU-KEG/SafetyNeuron. |
| Open Datasets | Yes | We utilize four different pre-trained LLMs: Llama2-7b-hf (Touvron et al., 2023), Mistral-7b-v0.1 (Jiang et al., 2023), Gemma-7b (Team et al., 2024) and Qwen2.5-3B (Qwen et al., 2025). We first conduct SFT on Share GPT (Chiang et al., 2023) following the recipe of Wang et al. (2024). Then we perform safety alignment using DPO on the HH-RLHF-Harmless (Bai et al., 2022a). To evaluate transferability, we select four benchmarks designed for red-teaming LLMs: Beavertails (Ji et al., 2024), Red Team (Ganguli et al., 2022), Harm Bench (Mazeika et al., 2024), and Jail Break LLMs (Shen et al., 2023). Additionally, we evaluate whether the enhancement of model safety comes at the expense of generation quality on various general benchmarks, including: Wikitext-2 (Merity et al., 2016), MMLU (Hendrycks et al., 2021), GSM8K (Cobbe et al., 2021), BBH (Suzgun et al., 2023), and Truthful QA (Lin et al., 2022). |
| Dataset Splits | Yes | For the safety evaluation benchmarks used in our study, we sampled 200 examples from each test set for evaluation. For MMLU, we use the entire test set... For BBH, we sampled 40 samples from each task for testing... For GSM8K, we sampled 200 samples... For Truthful QA, we utilize the official evaluation script, testing on the entire test set with the MC1 metric... |
| Hardware Specification | Yes | We run all the above experiments on NVIDIA A100-SXM4-80GB GPU, and it takes about 1,000 GPU hours. |
| Software Dependencies | No | We use Huggingface s transformers (Wolf et al., 2020) and peft (Mangrulkar et al., 2022) libraries to train our SFT model... We use Huggingface s trl (von Werra et al., 2020) library to train our DPO models. We build our code on Transformer Lens (Nanda and Bloom, 2022)... The classifier is Logistic Regression in scikit-learn (Pedregosa et al., 2011). |
| Experiment Setup | Yes | The training hyperparameters are shown in Table 5 (We find (IA)3 needs a much higher learning rate compared to Lo RA). Table 5: Hyperparameter used for SFT. Hyperparameters Value Learning Rate 1 10 3 Epochs 3 Optimizer Adam W Total Batch Size 120 Weight Decay 0.1 LR Scheduler Type cosine Target Modules down_proj Feedforward Modules down_proj The hyperparameters are the same as SFT, with an extra hyperparameter beta=0.1 for DPO. |