Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Refusal Direction is Universal Across Safety-Aligned Languages

Authors: Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schuetze, Barbara Plank

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate this hypothesis, we perform a series of activation-based interventions across multiple languages. To enable this cross-linguistic analysis, we develop Poly Refuse, a dataset containing translated harmful prompts across 14 linguistically diverse languages. We first extract refusal directions with English prompts and assess their effectiveness in modulating model behavior when applied to others. We then derive refusal directions from three safety-aligned non-English languages spanning diverse language families and scripts, and evaluated their transferability across the language spectrum.
Researcher Affiliation Academia 1LMU Munich 2Munich Center for Machine Learning EMAIL
Pseudocode No The paper describes methods using mathematical equations (1) to (5) and textual descriptions, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes We make our code publicly available at https://github.com/mainlp/Multilingual-Refusal.
Open Datasets Yes To enable this cross-linguistic analysis, we develop Poly Refuse, a dataset containing translated harmful prompts across 14 linguistically diverse languages. We begin with the English datasets used by Arditi et al. [2024], where Dharmful consists of harmful instructions from ADVBENCH [Zou et al., 2023], MALICIOUSINSTRUCT [Huang et al., 2024], and TDC2023 [Mazeika et al., 2024, 2023], while Dharmless contains samples from ALPACA [Taori et al., 2023].
Dataset Splits Yes Following that, we randomly sample 128 queries from both Dharmful and Dharmless categories to create the training sets Dtrain harmful and Dtrain harmless in each language. Similarly, we create validation sets Dval harmful with 32 samples per language to select the most effective refusal vectors. ... To evaluate the cross-lingual effectiveness of the extracted refusal vectors, we also construct a test set Dtest harmful containing 572 harmful prompts for each language.
Hardware Specification No The paper states in its NeurIPS checklist that it describes the compute used in section A.3, but section A.3 primarily contains experimental results and visualizations and does not specify hardware details like GPU/CPU models, memory, or cloud instance types.
Software Dependencies No The paper does not explicitly list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x) needed to replicate the experiments.
Experiment Setup Yes We adhere to this established methodology for vector ablation and addition operations. ... We measured two key metrics: (1) KL divergence between original and ablated first token probability distributions, which quantifies the distributional shift caused by ablation, and (2) refusal score, which directly measures the model s propensity to refuse harmful requests. ... Due to KL filtering, the selected refusal vector exhibits relatively low KL divergence while achieving substantial reductions in refusal scores. This maximizes the attack s effectiveness (high refusal score reduction) while minimizing unwanted side effects on the model s general behavior (low KL divergence).