WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

Authors: Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, Nouha Dziri

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive model training and evaluations, we identify the training properties that enable an ideal balance of safety behaviors: appropriate safeguarding without over-refusal, effective handling of both vanilla and adversarial queries, and minimal, if any, decrease in general capabilities. All the components of WILDJAILBREAK contribute to achieving balanced safety behaviors of models.
Researcher Affiliation Collaboration 1University of Washington 2Allen Institute for Artificial Intelligence 3Seoul National University 4Carnegie Mellon University
Pseudocode No The paper includes structured instruction prompts (Prompt 1, Prompt 2, Prompt 3) but does not label them as 'Pseudocode' or 'Algorithm'.
Open Source Code Yes Code & Models: https://github.com/allenai/wildteaming Data: https://huggingface.co/datasets/allenai/wildjailbreak
Open Datasets Yes Therefore, with WILDTEAMING we create WILDJAILBREAK, a large-scale open-source synthetic safety dataset with 262K vanilla (direct request) and adversarial (complex jailbreak) prompt-response pairs. Data: https://huggingface.co/datasets/allenai/wildjailbreak
Dataset Splits Yes We augment Tulu2Mix-no-refusal4 [37], a general capability instruction-tuning dataset consisting of 300K examples, with 200K examples sampled from WILDJAILBREAK, resulting in 500K examples. From WILDJAILBREAK we sample 50K each of vanilla harmful, adversarial harmful, vanilla benign, and adversarial benign items. For all training experiments, we follow the setup introduced in Tulu2 [37] and fine-tune a Llama2 7B base model on our 500K data mixture for 2 epochs.
Hardware Specification Yes We quantitatively compare the runtime and computational resources required for WILDTEAMING and other baselines, using NVIDIA RTX A6000 GPUs and Tulu2 DPO 7B as the target model.
Software Dependencies No Our training code was adopted from the Easy LM codebase [25]. Table 26 shows the training hyperparameters. It does not list specific software versions for libraries like PyTorch or other dependencies beyond the codebase name.
Experiment Setup Yes For all training experiments, we follow the setup introduced in Tulu2 [37] and fine-tune a Llama2 7B base model on our 500K data mixture for 2 epochs. Table 26: Hyperparameters used for instruction-tuning/supervised fine-tuning, consistent with the setup as [36] except that we choose a shorter max sequence length and smaller batch size due to compute constraint. Precision BFloat16 Epochs 2 Weight decay 0 Warmup ratio 0.03 Learning rate 2e-5 Max. seq. length 2048 Batch size 32