Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements
Authors: Jingyu Zhang, Ahmed Elgohary Ghoneim, Ahmed Magooda, Daniel Khashabi, Ben Van Durme
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6 EXPERIMENTS AND EMPIRICAL FINDINGS On Co SAlign-Test (Table 3), applying Co SAlign on LLAMA3.1-8BINSTRUCT and the SFT variant both significantly improves controllability measured by Co SA-Score over their respective base models. Our proposed Co SAlign method significantly outperforms all baselines, including strong cascade methods that use GPT-4o evaluator to filter out unsafe responses, in terms of overall Co SA-Score. |
| Researcher Affiliation | Collaboration | Jingyu Zhang Ahmed Elgohary Ahmed Magooda Daniel Khashabi Benjamin Van Durme Microsoft Responsible AI Research Johns Hopkins University Work done during Jingyu Zhang s internship at Microsoft. Correspondence to Jingyu Zhang {EMAIL} and Ahmed Elgohary {EMAIL}. |
| Pseudocode | Yes | Algorithm 1 Co SAlign response generation, error-scoring mechanism, and response paring |
| Open Source Code | No | Project page: https://aka.ms/controllable-safety-alignment |
| Open Datasets | Yes | We use the Beaver Tails dataset sourced from https://github.com/PKU-Alignment/Beaver Tails with Apache-2.0 license, and the Wild Guard Mix dataset sourced from https://huggingface.co/datasets/allenai/ wildguardmix with ODC-By license. |
| Dataset Splits | Yes | A.8 COSALIGN-TEST CONSTRUCTION We provide the breakdown of test prompt categories as follows, with number of prompts specified in parathesis. Seen configs: Test config: no risk allowed Allowed prompts (100): * No risk (100 prompts) Disallowed prompts (300): |
| Hardware Specification | Yes | All experiments are conducted with 4 NVIDIA A100 80GB GPUs. |
| Software Dependencies | No | The paper mentions software components like GPT-4o model, LLAMA3.1-8B-INSTRUCT, lm-evaluation-harness codebase, and Llama-Guard-3-8B, but does not provide specific version numbers for these or for core programming languages/libraries (e.g., Python, PyTorch, CUDA) required to replicate the experiment. |
| Experiment Setup | Yes | We choose hyperparameters α = 0.1, β = 3, γ = 1 to ensure α < γ < β. During training, we conduct SFT and DPO with the RMSPromp optimizer and learning rate of 5e-7, and DPO β = 0.1. |