Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reasoning as an Adaptive Defense for Safety

Authors: Taeyoun Kim, Fahim Tajwar, Aditi Raghunathan, Aviral Kumar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we study how to utilize this approach to train models that exhibit a degree of robustness to safety vulnerabilities, and show that doing so can provide benefits. We build a recipe called TARS (Training Adaptive Reasoners for Safety), a reinforcement learning (RL) approach that trains models to reason about safety using chainof-thought traces and a reward signal that balances safety with task completion. ... Models trained with TARS exhibit adaptive behaviors by spending more compute on ambiguous queries, leading to better safety-refusal trade-offs. They also internally learn to better distinguish between safe and unsafe prompts and attain greater robustness to both white-box (e.g., GCG) and black-box attacks (e.g., PAIR). ... 4 Experimental Setup: Since we compare both reasoning and non-reasoning models in our results, we explain the training and evaluation setups used in our experiments below. 5 Experimental Results: In this section, we investigate whether TARS can balance the safety-refusal trade-off, adapt to different prompts, anticipate attacks, and generalize to harmful and ambiguous prompts. We compare TARS against existing baselines as well as SFT, DPO, and RL in a controlled setting.
Researcher Affiliation Academia Taeyoun Kim Fahim Tajwar Aditi Raghunathan Aviral Kumar Carnegie Mellon University EMAIL
Pseudocode No The paper describes the TARS recipe and its stages (SFT, prompt design, RL) in detail, but it does not present any formal pseudocode blocks or algorithms labeled as such. The methodology is explained in descriptive prose.
Open Source Code Yes Overall, our work provides an effective, open recipe1 for training LLMs against jailbreaks and harmful requests by reasoning per prompt. 1We release our model and code at: https://training-adaptive-reasoners-safety.github.io
Open Datasets Yes To collect our training data, we gather 1000 harmful prompts from various sources: Wild Jailbreak [18], Aegis AI Content Safety Dataset 2.0 [9], and Safe Edit [57]. ... We collect harmful prompts from Wild Jailbreak [18] and Aegis AI Content Safety Dataset 2.0 [9] on a different subset from the SFT prompts. We additionally collect adversarial prompts by attacking πSFT via rainbow teaming [47] ( E), targeting prompts to which the model is more vulnerable. ... We mix in regular harmless requests from Ultra Feedback [6], where the answer needs to helpfully follow an instruction. ... We further mix in harmless prompts that may often be misclassified by the model as harmful, collected from the easier subset of OR-Bench [7] ... We evaluate safety using Harmbench [32], a jailbreaking benchmark ... To evaluate non-refusal (compliance), we use the safe subset of XSTest [45] ...
Dataset Splits Yes We compare 5 different ratios of harmful and harmless prompts λ = {0.1, 0.3, 0.5, 0.7, 0.9} when training Qwen-2.5-1.5B-Instruct through RL, where λ denotes the proportion of harmful prompts. For example, with a total of 2000 prompts, λ = 0.7 corresponds to 1400 harmful prompts and 600 harmless (+ambiguous) prompts. Additionally, we train our larger 7B flagship model on λ = 0.5 using Qwen-2.5-7B-Instruct as the base model.
Hardware Specification Yes We use the Adam W optimizer [28] and train each model on 4 A6000s for 5-10 hours.
Software Dependencies No The paper mentions several methods and models like GRPO [49], Adam W optimizer [28], Moderation API [31], and GRM [65], but it does not specify concrete version numbers for software libraries or environments (e.g., Python, PyTorch, CUDA versions) that would be needed for replication.
Experiment Setup Yes Practical implementation details. We use Qwen-2.5-1.5B-Instruct [54, 64] as our base model πbase, an instruction-tuned model without reasoning capabilities. For the SFT stage, we train for 3 epochs with a learning rate of 3 10 5 and batch size of 16, which aims for lightweight training. For RL training, we train for 3 epochs with a learning rate of 1 10 6, batch size of 32, KL coefficient of 1 10 3, 8 rollout generations, and a maximum generation length of 4096 tokens. We use the Adam W optimizer [28] and train each model on 4 A6000s for 5-10 hours. The prompt template includes the begin-of-thinking (BOT) token <think>. Template details are in F.