Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LLM Safety Alignment is Divergence Estimation in Disguise

Authors: Rajdeep Haldar, Ziyi Wang, Guang Lin, Yue XING, Qifan Song

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose KLDO, a novel KL divergence-based alignment method, and empirically validate its effectiveness. We further show that using compliance refusal datasets, rather than standard preference-based datasets, leads to stronger separation and improved safety alignment. Finally, to quantify the separation effect, we propose a distance-based metric in the prompt representation space, which also acts as a statistically significant indicator for model safety. ... Through extensive experiments ( 5), we confirm the theoretical predictions: alignment methods induce clear latent separation, and this separation is significantly correlated with model robustness ( 5.2.3).
Researcher Affiliation Academia 1Department of Statistics, Purdue University 2Department of Statistics, Michigan State University
Pseudocode No The paper describes methods like RLHF, DPO, KTO, BCO and introduces KLDO, but does not present any formal pseudocode or algorithm blocks. The methods are described using mathematical equations and textual explanations.
Open Source Code Yes We provide access to our data and anonymous code repo .
Open Datasets Yes We use two instruction tuning datasets Compliance Refusal (CR) and Preference (Pref) constructed from Safe Aligner (Huang et al., 2024) and Alpaca-GPT4-Data (Peng et al., 2023) based on data models in 3.2. ... Datasets are publicly available at CR and Pref.
Dataset Splits No The paper describes the construction of the datasets: "We randomly sample 628 prompts from Alpaca-GPT4-Data, and combined with the 628 unsafe prompts from Safe Aligner, we create a half-safe and unsafe set of prompts." However, it does not provide explicit details on how these combined datasets are split into training, validation, or test sets for the experiments.
Hardware Specification Yes We perform all our training on 2 Nvidia A100-80 GB Gpus.
Software Dependencies No The paper mentions the use of the Adam optimizer (Kingma & Ba, 2015; Zhang, 2018) and Low-Rank Adaptation (Lo RA) (Hu et al., 2022; Zhang et al., 2023; Dettmers et al., 2023), but it does not specify any version numbers for these or other software libraries like PyTorch, TensorFlow, or Python itself.
Experiment Setup Yes The training spans 5 epochs with a learning rate of 5 10 5, a batch size of 32, β = 0.1, and the Adam optimizer (Kingma & Ba, 2015; Zhang, 2018). We apply Low-Rank Adaptation (Lo RA) (Hu et al., 2022; Zhang et al., 2023; Dettmers et al., 2023) with α = 256, rank = 64, and dropout = 0.05.