Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Angular Steering: Behavior Control via Rotation in Activation Space

Authors: Minh Hieu Vu, Tan M. Nguyen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across multiple model families and sizes show that Angular Steering achieves robust behavioral control while maintaining general language modeling performance, underscoring its flexibility, generalization, and robustness compared to prior approaches.
Researcher Affiliation Collaboration Hieu M. Vu Independent EMAIL Tan M. Nguyen Department of Mathematics National University of Singapore EMAIL
Pseudocode Yes We summarize the algorithms for feature direction extraction, steering plane selection, and angular steering in Appendix B.
Open Source Code Yes Code and artifacts are available at https://github.com/lone17/angular-steering/.
Open Datasets Yes To calibrate the feature (refusal) direction, we construct two datasets: D(cal) harmful, which is a split (80%) of the ADVBENCH dataset [59] consisting of 416 harmful instructions; and D(cal) harmful, a random subset of 512 harmless examples from the ALPACA dataset [48]. For evaluating steering effectiveness, we use the remaining 20% of ADVBENCH, denoted as D(eval) harmful, containing 104 samples. To assess general language modeling capabilities, we employ the TINYBENCHMARKS dataset [24], a collection of reduced-scale benchmarks each containing 100 examples: ARC [8], MMLU [15], WINOGRANDE [40], GSM8K [9], TRUTHFULQA [22], and HELLASWAG [56].
Dataset Splits Yes To calibrate the feature (refusal) direction, we construct two datasets: D(cal) harmful, which is a split (80%) of the ADVBENCH dataset [59] consisting of 416 harmful instructions; and D(cal) harmful, a random subset of 512 harmless examples from the ALPACA dataset [48]. For evaluating steering effectiveness, we use the remaining 20% of ADVBENCH, denoted as D(eval) harmful, containing 104 samples.
Hardware Specification Yes This research was conducted using mainly Nvidia H100 GPUs with 80GB of memory.
Software Dependencies No For each model: Constructing the steering plane took about 15 minutes on 1 GPU using TRANSFORMERLENS [30]. Pre-generating responses for evaluation took about 10 minutes on 1 GPU using our fork of vLLM [18] as the serving engine. Evaluation with substring matching [1], LLAMA 3 GUARD [23] and HARMBENCH [27] collectively took about 10 minutes on 1 GPU using vLLM [18] as the serving enging. Evaluation with LLM-as-a-judge took about 50 minutes on 4 GPUs using vLLM [18] as the serving engine. Computing perplexity scores took about 5 minutes on 1 GPU. Evaluation with TINYBENCHMARKS [24] took about 4 hours on 1 GPU using vLLM [18] as the serving engine and LM HARNESS [13] as the evaluation device. Our fork of the vLLM project with Angular Steering integrated can be found at https://github.com/lone17/vllm/tree/feat/steering.
Experiment Setup Yes For inference, we apply Adaptive Angular Steering as described in Eqn. 3 on every normalization module before each Attention and MLP layer. By varying the target angular position θ from 0 to 360 degrees (with 10-degree intervals), we observe that the models change from refusal to compliance and back to refusal again (see Fig. 7).