Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Lifelong Safety Alignment for Language Models
Authors: Haoyu Wang, Yifei Zhao, Zeyu Qin, Chao Du, Min Lin, Xueqian Wang, Tianyu Pang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3 Experiments We first describe our experiments settings as follows: ...Table 3: Seen Attacks Evaluation. The ASR is measured in percentage (%). Table 4: Unseen Attacks Evaluation. The ASR is measured in percentage (%). Table 5: Generalization Attacks Evaluation. The ASR is measured in percentage (%). Table 6: Helpfulness Evaluation. The accuracy is measured in percentage (%). Table 7: The ASR of training with all successful strategies one each goal. Table 8: The ASR of training with one successful strategy on each goal. |
| Researcher Affiliation | Collaboration | 1Sea AI Lab, Singapore 2Tsinghua University 3The Hong Kong University of Science and Technology EMAIL; EMAIL |
| Pseudocode | Yes | Algorithm 1 Lifelong Safety Alignment Input: Iteration Times T , Goal Pool G, Meta-Attacker A0, Defender M0, Safeguard Mj, Refusal Generator Mr, Maximum Interaction Times N, Threshold of Successful Goals Percentage K, Successful Buffer Bs, Failed Buffer Bf, Adversarial-Play Evolution of Meta-Attacker process F1, Adversarial-Play Evolution of Defender process F2. for t = 0 to T 1 do At+1 = F1(g, At, Mt, Mj, Bf, Bs, K, N) Mt+1 = F2(M0, Mr, Bs, D) end for |
| Open Source Code | Yes | The code is available at https://github.com/sail-sg/Lifelong Safety Alignment. |
| Open Datasets | Yes | Datasets. We include 4k illegal instructions from PKU-Safe RLHF [25] as Goal Pool G. We adopt 20k Ultrachat [13] as helpfulness maintaining dataset; we adopt successful jailbreak questions in Bs and corresponding refusal answers as safety training dataset. We include XSTest [49] in the Defender training dataset to avoid over-refusal problem. |
| Dataset Splits | Yes | From this dataset, we extract 4K illegal instructions as the goals in this work and randomly select another 100 goals as test set. To ensure the extracted questions are genuinely harmful, we conduct both human evaluations and evaluations using LLa MA-Guard-3-8B. Ultrachat is a large-scale, fine-grained, and diverse dataset comprising Questions about the World, Writing and Creation, Assistance on Existent Materials. From this dataset, we randomly extract 20K Question & Answer pairs for helpfulness finetuning. XSTest is a dataset comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with and 200 unsafe prompts as contrasts that, for most LLM applications, should be refused. |
| Hardware Specification | Yes | We use 8 A100 40G to conduct experiments. |
| Software Dependencies | Yes | We use v LLM [33] Version 0.6.3 to inference our models. |
| Experiment Setup | Yes | The training configuration includes a cutoff length of 4096, a batch size of 64, 3 training epochs, a cosine learning rate scheduler, and a warmup ratio of 0.1. For SFT with Lo RA, we set learning rate to 1e 4. For full finetuning, we set learning rate to 1e 5. For the inference of the Defender models, we set the temperature to 0.95 and the cut off length to 4096. For Best of N sampling on Deep Seek-R1-Distill-Qwen, we set the temperature = 0.7, as they recommend. |