Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

Authors: Jingyu Zhang, Ahmed Elgohary Ghoneim, Ahmed Magooda, Daniel Khashabi, Ben Van Durme

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6 EXPERIMENTS AND EMPIRICAL FINDINGS On Co SAlign-Test (Table 3), applying Co SAlign on LLAMA3.1-8BINSTRUCT and the SFT variant both significantly improves controllability measured by Co SA-Score over their respective base models. Our proposed Co SAlign method significantly outperforms all baselines, including strong cascade methods that use GPT-4o evaluator to filter out unsafe responses, in terms of overall Co SA-Score.
Researcher Affiliation Collaboration Jingyu Zhang Ahmed Elgohary Ahmed Magooda Daniel Khashabi Benjamin Van Durme Microsoft Responsible AI Research Johns Hopkins University Work done during Jingyu Zhang s internship at Microsoft. Correspondence to Jingyu Zhang {EMAIL} and Ahmed Elgohary {EMAIL}.
Pseudocode Yes Algorithm 1 Co SAlign response generation, error-scoring mechanism, and response paring
Open Source Code No Project page: https://aka.ms/controllable-safety-alignment
Open Datasets Yes We use the Beaver Tails dataset sourced from https://github.com/PKU-Alignment/Beaver Tails with Apache-2.0 license, and the Wild Guard Mix dataset sourced from https://huggingface.co/datasets/allenai/ wildguardmix with ODC-By license.
Dataset Splits Yes A.8 COSALIGN-TEST CONSTRUCTION We provide the breakdown of test prompt categories as follows, with number of prompts specified in parathesis. Seen configs: Test config: no risk allowed Allowed prompts (100): * No risk (100 prompts) Disallowed prompts (300):
Hardware Specification Yes All experiments are conducted with 4 NVIDIA A100 80GB GPUs.
Software Dependencies No The paper mentions software components like GPT-4o model, LLAMA3.1-8B-INSTRUCT, lm-evaluation-harness codebase, and Llama-Guard-3-8B, but does not provide specific version numbers for these or for core programming languages/libraries (e.g., Python, PyTorch, CUDA) required to replicate the experiment.
Experiment Setup Yes We choose hyperparameters α = 0.1, β = 3, γ = 1 to ensure α < γ < β. During training, we conduct SFT and DPO with the RMSPromp optimizer and learning rate of 5e-7, and DPO β = 0.1.