Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior
Authors: Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, Sydney Levine
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On comprehensive benchmarks, we show that SAFETYANALYST (average F1=0.81) outperforms existing moderation systems (average F1<0.72) on prompt safety classification, while offering the additional advantages of interpretability, transparency, and steerability. |
| Researcher Affiliation | Collaboration | 1University of California, Berkeley 2Allen Institute for Artificial Intelligence 3University of Washington 4Duke University 5Carnegie Mellon University 6Stanford University. Correspondence to: Jing-Jing Li <EMAIL>. |
| Pseudocode | No | The paper describes the SAFETYANALYST framework and its components, including chain-of-thought reasoning and a feature aggregation model, but does not provide any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | To demonstrate this framework, we train and release an open-source system for LLM prompt safety moderation, including a pair of LMs that specialize in generating harm trees and benefit trees, a transparent aggregation model with fully interpretable weight parameters, and a procedure for aligning the aggregation weights to given prompts labeled as safe or unsafe. In addition, we release a series of other resources that enable researchers and engineers to build on SAFETYANALYST: a large-scale dataset of 18.5 million harm-benefit features generated by frontier LLMs on 19k prompts, and the first taxonomies of harmful and beneficial effects for AI safety. 1https://jl3676.github.io/Safety Analyst/ |
| Open Datasets | Yes | In addition, we release a series of other resources that enable researchers and engineers to build on SAFETYANALYST: a large-scale dataset of 18.5 million harm-benefit features generated by frontier LLMs on 19k prompts, and the first taxonomies of harmful and beneficial effects for AI safety. ... randomly sampled from public prompt datasets including Wild Jailbreak (Jiang et al., 2024), Wild Chat (Zhao et al., 2024), and Aegis Safety Train (Ghosh et al., 2024). |
| Dataset Splits | No | The paper mentions aligning aggregation model parameters to "250 harmful and 250 benign held-out Wild Jailbreak prompts" and evaluating on a "balanced test set of 300 prompts". However, it does not provide explicit training, validation, and test splits (e.g., percentages or exact counts for all splits) for the primary task of fine-tuning the student LMs on the 18.5 million harm-benefit features generated from 19k prompts. |
| Hardware Specification | Yes | Fine-tuning was performed with a context window length of 18,000 tokens on 8 NVIDIA H100 GPUs. ... Experiments using open-weight models were run on one NVIDIA H100 GPU with batched inference using vllm (Kwon et al., 2023). |
| Software Dependencies | No | The paper mentions using 'qlora (Dettmers et al., 2024; Lambert et al., 2024)' for fine-tuning and 'vllm (Kwon et al., 2023)' for batched inference. However, specific version numbers for these software components are not provided. |
| Experiment Setup | No | The paper mentions that "Fine-tuning was performed with a context window length of 18,000 tokens" and parameters were optimized by "minimizing the loss computed as the negative log-sigmoid harmfulness score, log σ(H) , constraining all parameters within [0, 1]". However, specific hyperparameters such as learning rates, batch sizes, number of epochs for the fine-tuning process, or specific optimizer settings are not explicitly detailed in the main text. |