Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning

Authors: Xin Yi, Shunfan Zheng, Linlin Wang, Gerard de Melo, Xiaoling Wang, Liang He

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate significant safety enhancements in fine-tuned models across multiple downstream tasks, while greatly maintaining task-level accuracy.
Researcher Affiliation Academia 1 East China Normal University 2 Hasso Plattner Institute 3 University of Potsdam EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods using mathematical formulas and flowcharts (Figure 2), but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/xinykou/NLSR
Open Datasets Yes During the alignment phase, we sample a preference dataset consisting of 2,000 instances from PKU-Safe RLHF (Ji et al. 2024) and utilize Lo RA (Hu et al. 2022) for SFT, DPO (Rafailov et al. 2024), ORPO (Hong, Lee, and Thorne 2024), KTO (Ethayarajh et al. 2024), and Sim PO (Meng, Xia, and Chen 2024)... Following the experimental setup in Vaccine (Huang, Hu, and Liu 2024), we fine-tune our models on three downstream tasks: SST-2 (Socher et al. 2013), AGNEWS (Zhang, Zhao, and Le Cun 2015), and GSM8K (Cobbe et al. 2021). To inject poisoned instructions into these task-specific datasets, we configure each training dataset to contain n = 1, 000 instances, with a poisoning proportion of p = 0.05 from Beaver Tails (Ji et al. 2024).
Dataset Splits No The paper mentions datasets used for training and evaluation but does not provide specific details on how these datasets were split into training, validation, or test sets, nor does it refer to standard predefined splits.
Hardware Specification No The paper describes various experimental settings and implementation details, including optimizer, learning rates, epochs, and batch sizes, but does not specify the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No The paper mentions using LoRA and the AdamW optimizer, but does not provide specific version numbers for these tools or any other software dependencies like programming languages or deep learning frameworks.
Experiment Setup Yes We utilize Lo RA to train a safety-aligned model, which is subsequently fine-tuned for specific downstream tasks. Specifically, we update a small fraction of parameters with a rank of 128. In the alignment stage, we use the Adam W optimizer with a learning rate of 2 10 6, except for the ORPO with a learning rate of 2 10 4. The number of training epochs for the alignment stage is universally set to 3. In the fine-tuning stage, the training epochs for all datasets are all set to 10. The batch size is consistently set to 8 for both stages. Unless otherwise specified, the sparsity rate is PSR = 0.8, corresponding to a safety region ratio of 0.2. Additionally, the layer pruning rate is set as PL = 0.5.