Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment
Authors: Wonje Jeung, Yoon Sangyeon, Minsuk Kahng, Albert No
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results across multiple benchmarks indicate that SAFEPATH effectively reduces harmful outputs while maintaining reasoning performance. |
| Researcher Affiliation | Academia | Wonje Jeung1 Sangyeon Yoon1 Minsuk Kahng2 Albert No1 1 Department of Artificial Intelligence, Yonsei University 2 Department of Computer Science and Engineering, Yonsei University EMAIL |
| Pseudocode | No | The paper describes the SAFEPATH method and its training process in detail, including specific instruction formats and logic, but it does not present these steps within a formal pseudocode block or algorithm environment. |
| Open Source Code | Yes | We release model and code at https://ai-isl.github.io/safepath. |
| Open Datasets | Yes | We use Wild Jailbreak [Jiang et al., 2024] as the Safety Trigger set and Deep Seek Math 220K [Guo et al., 2025] as the Reasoning Retain set. |
| Dataset Splits | Yes | The R-7B model is trained on 400 Safety Trigger set samples for 100 steps with a batch size of 4, without using the Reasoning Retain set. The R-8B model is trained on 40 samples from each set (80 total) for 20 steps with a batch size of 4. |
| Hardware Specification | Yes | All experiments were conducted on a system with 512 CPU cores, 8 Nvidia RTX L40S (48GB) GPUs, and 1024 GB of RAM. |
| Software Dependencies | No | The paper mentions using tools like 'lm-evaluation-harness' and 'AI2 evaluation codebase' but does not specify their version numbers or other key software dependencies with specific versions. |
| Experiment Setup | Yes | Both datasets are trained with a learning rate of 1 10 5. The R-7B model is trained on 400 Safety Trigger set samples for 100 steps with a batch size of 4... The R-8B model is trained on 40 samples from each set (80 total) for 20 steps with a batch size of 4. |