Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Finding and Reactivating Post-Trained LLMs' Hidden Safety Mechanisms

Authors: Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, Yisen Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on four state-of-the-art LRMs show that our method significantly improves safety on harmful prompts without compromising reasoning performance. [...] Section 5 Experiments
Researcher Affiliation	Academia	1 CISPA Helmholtz Center for Information Security 2 State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 3 Institute for Artificial Intelligence, Peking University
Pseudocode	No	The paper describes methods using mathematical formulas and textual descriptions, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code	Yes	Code is available at https://github.com/homles11/Safe Re Act.
Open Datasets	Yes	We choose Harm Bench [47] as the harm dataset to provide unsafe prompts to align the model s representations. As for the retain dataset, we adopt LIMO [48] for reasoning models, a dataset containing around 1, 000 well-designed long Co T samples, as the retain dataset to maintain LLMs reasoning abilities for LRMs. As for the medical model, we adopt the first 20, 00 samples in Ultra Medical 3 as the retain dataset. [...] Firstly, we adopt Adv Bench [14], Jailbreak Bench (JBB for convenience) [16], and Xs Test [49] datasets to evaluate different models safety. [...] As for models utility on reasoning tasks, we adopt the widely used GSM8K [50], MATH-500 [51]. For the medical evaluation, we adopt the Med QA for evaluation.
Dataset Splits	Yes	As for the retain dataset, we adopt LIMO [48] for reasoning models, a dataset containing around 1, 000 well-designed long Co T samples, as the retain dataset to maintain LLMs reasoning abilities for LRMs. As for the medical model, we adopt the first 20, 00 samples in Ultra Medical 3 as the retain dataset.
Hardware Specification	Yes	All the experiments are finished on a single NVIDIA A100 80GB. [...] All evaluations are performed using v LLM [52] on a single NVIDIA A100 80GB.
Software Dependencies	No	The paper mentions using 'Lo RA training' and 'Adam optimizer' and 'v LLM [52]', but does not specify version numbers for these or other software components like Python, PyTorch, or CUDA.
Experiment Setup	Yes	The default layer index set I is set to be every five layers for efficiency, as former works [28, 29] show that only optimizing a few key layers representations in LLMs is enough. For example, the layer index set I is {5, 10, 15, 20, 25, 30} for R1-8B. The default Lo RA rank is set to be 16 with the hyperparameter α set to be 10. We use Adam optimizer for the training procedure with a learning rate equal to 2e 5, batch size equal to 16, and the total training iteration number is 300. Hyperparameters q and p for reasoning abilities pruning are selected based on the pruned safe model Msafe s safety results on Jailbreak Bench.