Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Finding and Reactivating Post-Trained LLMs' Hidden Safety Mechanisms
Authors: Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, Yisen Wang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on four state-of-the-art LRMs show that our method significantly improves safety on harmful prompts without compromising reasoning performance. [...] Section 5 Experiments |
| Researcher Affiliation | Academia | 1 CISPA Helmholtz Center for Information Security 2 State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 3 Institute for Artificial Intelligence, Peking University |
| Pseudocode | No | The paper describes methods using mathematical formulas and textual descriptions, but no explicit pseudocode or algorithm blocks are provided. |
| Open Source Code | Yes | Code is available at https://github.com/homles11/Safe Re Act. |
| Open Datasets | Yes | We choose Harm Bench [47] as the harm dataset to provide unsafe prompts to align the model s representations. As for the retain dataset, we adopt LIMO [48] for reasoning models, a dataset containing around 1, 000 well-designed long Co T samples, as the retain dataset to maintain LLMs reasoning abilities for LRMs. As for the medical model, we adopt the first 20, 00 samples in Ultra Medical 3 as the retain dataset. [...] Firstly, we adopt Adv Bench [14], Jailbreak Bench (JBB for convenience) [16], and Xs Test [49] datasets to evaluate different models safety. [...] As for models utility on reasoning tasks, we adopt the widely used GSM8K [50], MATH-500 [51]. For the medical evaluation, we adopt the Med QA for evaluation. |
| Dataset Splits | Yes | As for the retain dataset, we adopt LIMO [48] for reasoning models, a dataset containing around 1, 000 well-designed long Co T samples, as the retain dataset to maintain LLMs reasoning abilities for LRMs. As for the medical model, we adopt the first 20, 00 samples in Ultra Medical 3 as the retain dataset. |
| Hardware Specification | Yes | All the experiments are finished on a single NVIDIA A100 80GB. [...] All evaluations are performed using v LLM [52] on a single NVIDIA A100 80GB. |
| Software Dependencies | No | The paper mentions using 'Lo RA training' and 'Adam optimizer' and 'v LLM [52]', but does not specify version numbers for these or other software components like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | The default layer index set I is set to be every five layers for efficiency, as former works [28, 29] show that only optimizing a few key layers representations in LLMs is enough. For example, the layer index set I is {5, 10, 15, 20, 25, 30} for R1-8B. The default Lo RA rank is set to be 16 with the hyperparameter α set to be 10. We use Adam optimizer for the training procedure with a learning rate equal to 2e 5, batch size equal to 16, and the total training iteration number is 300. Hyperparameters q and p for reasoning abilities pruning are selected based on the pruned safe model Msafe s safety results on Jailbreak Bench. |