Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards

Authors: jingnan zheng, Xiangtian Ji, Yijun Lu, Chenhang Cui, Weixiang Zhao, Gelei Deng, Zhenkai Liang, An Zhang, Tat-Seng Chua

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that RSafe matches state-of-the-art guard models using limited amount of public data in both prompt- and response-level harmfulness detection, while achieving superior out-of-distribution generalization on both emerging harmful category and jailbreak attacks.
Researcher Affiliation Academia 1 National University of Singapore 2 Cornell University 3 University of Electronic Science and Technology 4 Harbin Insititute of Technology, China 5 Nanyang Technological University
Pseudocode No The paper describes the methodology in Section 3, titled 'Method of RSafe', which explains the Guided Reasoning and Reinforced Alignment components. The overall framework is illustrated in Figure 2, and a detailed example in Figure 3. However, there are no explicitly labeled pseudocode or algorithm blocks with structured steps in the paper.
Open Source Code Yes Our code is available at https://github.com/SophieZheng998/RSafe.git
Open Datasets Yes Datasets. To verify RSafe s effectiveness and robustness as a guard model, we conduct experiments on seven datasets for two tasks: prompt harmfulness detection (Toxic Chat [40], Open AI Moderation [19], Aegis Safety Test [15], Wild Guard Test [18]) and response harmfulness detection (PKU-Safe RLHF [41], Beaver Tails [39], XSTest Response [42]).
Dataset Splits Yes To construct RSafe s training dataset, a balanced set of harmful and unharmful samples was drawn from each selected source dataset, ensuring a 1:1 distribution of safety labels within each individual subset. To further enhance data diversity and improve RSafe s ability to handle over-refusal scenarios, we additionally incorporate subsets from OR-Bench [72]. Table 3 provides a detailed breakdown of the training dataset, while Table 4 presents the test datasets used to evaluate RSafe s effectiveness. For robustness evaluation, we employ the Wild Guard Test dataset, with detailed statistics provided in Table 5. We sample approximately 10K publicly available examples from the training splits of the six datasets used for effectiveness evaluation, without additional human curation or synthetic augmentation.
Hardware Specification Yes We utilize the VERL [73] codebase for model training, using 4 A100 80GB GPUs with a batch size of 128 and a maximum input sequence length of 2048.
Software Dependencies No We utilize the VERL [73] codebase for model training, using 4 A100 80GB GPUs with a batch size of 128 and a maximum input sequence length of 2048. While a codebase is mentioned, specific version numbers for key software components (e.g., Python, PyTorch) are not provided.
Experiment Setup Yes We utilize the VERL [73] codebase for model training, using 4 A100 80GB GPUs with a batch size of 128 and a maximum input sequence length of 2048. During training, we perform 4 rollouts and train for 3 epochs over the entire dataset, adopting a learning rate of 1e 7 for the actor model.