Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Safety Pretraining: Toward the Next Generation of Safe AI

Authors: Pratyush Maini, Sachin Goyal, Dylan Sam, Alexander Robey, Yash Savani, Yiding Jiang, Andy Zou, Matt Fredrikson, Zachary C. Lipton, Zico Kolter

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our safety-pretrained models reduce attack success rates from 38.8% to 8.4% on standard LLM safety benchmarks with no performance degradation on general tasks. Empirically, our Safety Pretraining significantly reduces the rate of harmful generations, achieving a reduction in attack success rate (ASR) from 38.8% to 8.3% on safety benchmarks.
Researcher Affiliation Collaboration Pratyush Maini 1,2 Sachin Goyal 1 Dylan Sam 1 Alex Robey1,4 Yash Savani1 Yiding Jiang1 Andy Zou1,3,4 Matt Fredrikson1,4 Zachary C. Lipton1 J. Zico Kolter1 1Carnegie Mellon University 2Datology AI 3Center for AI Safety 4Gray Swan AI
Pseudocode Yes Algorithm 1: Harmfulness-Tag Annotated Pretraining Input: Unsafe text segment D = {w1, w2, . . . , wn} Output: Modified text D with inline Harmfulness-Tag tokens D w1 ; // Initialize with first word for i = 2 to n do if random() < p then D D + Harmfulness-Tag + wi ; // Insert tag before word with probability p else D D + wi ; // Append next word normally return D ; // Return tag-injected sequence Algorithm 2: Safe Beam Search with Harmfulness Filtering Input: Prompt P, beam size k, harmful token τ = <potentially unsafe content>, model f Output: Decoded sequence that avoids unsafe continuations Initialize beam set B0 = {(P, log p = 0)} ; // Each beam: (text, cumulative log-prob) for each decoding step t do foreach beam (y, log py) Bt 1 do Compute top-N candidates t1, . . . , t N f( | y) ; foreach token ti do y = y ti ; // Extend sequence log py = log py + log f(ti | y) ; // Updated log-prob pτ(y ) = f(τ | y ) ; // Lookahead for harmful tag Form candidate set Ct = {(y , log py , pτ(y ))} ; Discard 50% of candidates with highest pτ(y ) ; // Filter out risky beams Select top-k candidates by log-prob to form Bt ; return ˆy = arg max(y,log py) BT log py
Open Source Code Yes Our work culminates in the open-source release of a family of natively safe 1.7B parameter language models. We also release our classifier at https://huggingface.co/locuslab/safety-classifier gte-largeen-v1.5 .
Open Datasets Yes Notably, we release Safe Web, a safety-focused 100 billion token synthetic data corpus... The dataset is publicly accessible on Hugging Face at https://huggingface.co/datasets/locuslab/safeweb . The Refuse Web dataset is publicly accessible on Hugging Face at https://huggingface.co/datasets/locuslab/refuseweb . The Moral Education dataset is publicly accessible on Hugging Face at https://huggingface.co/datasets/locuslab/moral education . We release the datasets for base model safety evaluation at https://huggingface.co/datasets/locuslab/jb-completions .
Dataset Splits No The paper does not provide explicit training/test/validation splits for its main model pretraining or instruction tuning. It mentions mixing datasets and injecting fractions of data (e.g., "inject a small fraction (10%) of harmfulness-tag annotated completions"), and uses standard benchmarks for evaluation, but does not specify how its *own* experimental data (or combined data) was split for training, validation, and testing of its primary models. For the safety classifier, it vaguely states "We hold out a portion of these annotations as an evaluation set" without specific numbers or percentages.
Hardware Specification Yes To perform pretraining for each of the 1.7B parameter models on 600B tokens, we used 4 nodes of 8x H100 GPUs for roughly 6-7 days.
Software Dependencies Yes All training is performed using the Lit GPT framework (AI, 2023), with Flash Attention-2 enabled and mixed-precision training for efficiency. We also release our classifier at https://huggingface.co/locuslab/safety-classifier gte-largeen-v1.5 . We adopt a classifier finetuned from the gte-base-en-v1.5 embedding model (Zhang et al., 2024).
Experiment Setup Yes We adopt the same optimization hyperparameters (e.g., learning rate schedule, batch size, and sequence length) as used in the original Smol LM2 pretraining setup to ensure comparability across scaling studies. For our smaller embedding-based models (e.g., ), we perform standard finetuning of all parameters with a learning rate of 1e-5, a batch size of 8, and weight decay of 0.001 for 50 epochs. For our larger embedding-based models (e.g., gte-large-en-v1.5, multilingual-e5-large-instruct, Arctic-embed-l-v2.0), we first train the linear head only for a single epoch with a batch size of 32 and a learning rate of 1e-3. We then perform full finetuning for all models with a learning rate of 1e-6, a batch size of 8, and weight decay of 0.001 for 5 epochs. We use λ = 100 for the larger embedding models. Among the tested values, inserting harmfulness-tag in 5% of the sequence positions achieves the lowest Attack Success Rate (ASR), striking a balance between signal sparsity and training signal strength.