Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LLMs Encode Harmfulness and Refusal Separately

Authors: Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, Weiyan Shi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we identify a new dimension to analyze safety mechanisms in LLMs, i.e., harmfulness, which is encoded internally as a separate concept from refusal. As causal evidence, steering along the harmfulness direction can lead LLMs to interpret harmless instructions as harmful... For instance, our Latent Guard achieves performance comparable to or better than Llama Guard 3 8B... Our code is released at https://github.com/CHATS-lab/Llms_Encode_Harmfulness_Refusal_ Separately. 39th Conference on Neural Information Processing Systems (Neur IPS 2025). The entire sections 3, 4, 5, 6, 7 are about experiments, results, and applications.
Researcher Affiliation Academia Jiachen Zhao Northeastern University Jing Huang Stanford University Zhengxuan Wu Stanford University David Bau Northeastern University Weiyan Shi Northeastern University
Pseudocode No The paper describes methods and procedures in narrative text and uses mathematical equations (e.g., Equation 1, 2, 3, 5) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is released at https://github.com/CHATS-lab/Llms_Encode_Harmfulness_Refusal_ Separately.
Open Datasets Yes For harmful instructions, we use Advbench [Zou et al., 2023b], JBB [Chao et al., 2024], and Sorry-Bench [Xie et al., 2025]... we follow previous work [Arditi et al., 2024] to use ALPACA, an instruction finetuning dataset [Taori et al., 2023]... We use examples from Xstest [Röttger et al., 2023]... We employ the Toxic Chat [Lin et al., 2023] and the Open AI Moderation Evaluation Dataset [Markov et al., 2023].
Dataset Splits Yes We sample 100 harmful and 100 harmless examples from the training set (see details in Appendix D) to compute the centroid of clusters. ... the Sorry-Bench dataset is held out and used as accepted harmful instructions for evaluation. We sample 100 harmful instructions refused at tpost-inst position from Advbench and JBB to compute the center of the harmfulness cluster... we run through Xstest [Röttger et al., 2023] for each model to find refused harmless instructions, which are then held out for testing. ... We also randomly sample 100 harmless instructions accepted at tpost-inst to compute the center of harmlessness cluster.
Hardware Specification Yes Experiments on these models are run on A100-40GB GPUs.
Software Dependencies No The paper mentions using specific LLM models (LLAMA2-CHAT-7B, LLAMA3-INSTRUCT-8B, QWEN2-INSTRUCT-7B) and their underlying Transformer architecture, but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA used for the experiments.
Experiment Setup Yes In this section, we describe the setup in our following experiments. Models. We focus on widely-used instruct models... We use three widely-adopted open-source models: LLAMA2-CHAT-7B [Touvron et al., 2023], LLAMA3-INSTRUCT-8B [Meta AI, 2024] and QWEN2-INSTRUCT7B [Yang et al., 2024]... Prompting templates. These instruct models all have their own chat templates... Hidden states extraction. Decoder-only Transformers [Vaswani et al., 2017] are the backbone of mainstream LLMs... We consider two token positions: (1) Instruction tinst... (2) Post-instruction tpost-inst... Datasets. We employ a wide range of public datasets... Jailbreak methods. We employ three different types of jailbreak methods... Refusal rate. Instruct models are usually finetuned to return certain fixed phrases... (Section 3.4) We then intervene on the residual stream for the hidden state of test examples using activation addition at layer l, i.e., h l = hl + vl harmful to all tokens of input instructions. As comparison, we also extract a refusal direction as vl refuse = µl, tpost-inst refuse µl, tpost-inst accept at token tpost-inst.