Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation
Authors: Yibo Wang, Tiansheng Huang, Li Shen, Huanjin Yao, Haotian Luo, Rui Liu, Naiqiang Tan, Jiaxing Huang, Dacheng Tao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments are conducted on different harmful ratios, fine-tuning tasks and mainstream LLMs, where the average harmful scores are reduced by up-to 21.2%, while maintaining fine-tuning performance. As a by-product, we analyze the adaptive perturbation and show that different layers in various LLMs have distinct safety affinity, which coincide with finding from several previous study. Source code available at https://github. com/w-yibo/Panacea. |
| Researcher Affiliation | Collaboration | 1 Tsinghua University; 2 Shenzhen Campus of Sun Yat-sen University; 3 Didichuxing Co. Ltd; 4 Nanyang Technological University |
| Pseudocode | Yes | Algorithm 1 Panacea: Adaptive Perturbation Optimization Input Parameters w, perturbation intensity ρ, regularizer intensity λ, learning rate η, number of iterations T; Output Re-aligned model wε. |
| Open Source Code | Yes | Source code available at https://github. com/w-yibo/Panacea. |
| Open Datasets | Yes | Three distinct datasets are utilized: the alignment dataset, harmful dataset, and fine-tuning dataset. The alignment dataset and harmful dataset are derived from the Rep Noise [9], which extracts subsets from the Beaver Tails dataset [143]. Specifically, 5,000 examples are sampled for the alignment dataset, and 1,000 examples for the harmful dataset. The fine-tuning dataset is constructed from four downstream fine-tuning tasks: GSM8K [141], SST2 [144], Alpaca Eval [145], and AGNEWS [146], with 1,000 samples collected from each task (700 samples from Alpaca Eval). |
| Dataset Splits | Yes | The alignment dataset is sampled from Beaver Tail [143] with 5000 instances, while the harmful dataset is also sampled from Beaver Tail with 1000 instances. The fine-tuning dataset is a mixture of benign fine-tuning samples and harmful samples. The benign fine-tuning samples come from GSM8K, SST2, Alpaca Eval, and AGNEWS, with 1000, 1000, 700 (due to the limited training data for this task), and 1000 instances, respectively. The harmful samples are also sampled from Beaver Tail but follow a different distribution than the harmful/alignment dataset. Testing Details. Following [66], the test dataset for harmful score (HS) is sampled from the Beaver Tail test set with 1000 instances, while the test datasets for fine-tuning accuracy (FA) are sampled from the GSM8k, SST2, Alpaca Eval, and AGNEWS test sets with 1000, 872, 105, and 1000 instances, respectively. |
| Hardware Specification | Yes | Most experiments are conducted on a single L40S, while Rep Noise and other LLMs (Gemma2-9B and Qwen2-7B) are run on a single A100-80G. |
| Software Dependencies | No | For efficient training, the approach follows the methodology [8], utilizing Lo RA [147] with a rank of 32 and an alpha value of 4. And the optimizer is Adam W [148]. |
| Experiment Setup | Yes | For efficient training, the approach follows the methodology [8], utilizing Lo RA [147] with a rank of 32 and an alpha value of 4. And the optimizer is Adam W [148]. During the alignment stage, the learning rate is set to 5e 4, the batch size is 10, and the training is performed for 20 epochs. For the fine-tuning stage, the learning rate is set to 2e 5, with a batch size for 10 and a training epoch for 20. These settings are applied uniformly across all datasets and baselines, with the default dataset being GSM8K [141] and the default model being Llama27B [149] following [9, 66]. |