WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Authors: Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, Nouha Dziri
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive model training and evaluations, we identify the training properties that enable an ideal balance of safety behaviors: appropriate safeguarding without over-refusal, effective handling of both vanilla and adversarial queries, and minimal, if any, decrease in general capabilities. All the components of WILDJAILBREAK contribute to achieving balanced safety behaviors of models. |
| Researcher Affiliation | Collaboration | 1University of Washington 2Allen Institute for Artificial Intelligence 3Seoul National University 4Carnegie Mellon University |
| Pseudocode | No | The paper includes structured instruction prompts (Prompt 1, Prompt 2, Prompt 3) but does not label them as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | Code & Models: https://github.com/allenai/wildteaming Data: https://huggingface.co/datasets/allenai/wildjailbreak |
| Open Datasets | Yes | Therefore, with WILDTEAMING we create WILDJAILBREAK, a large-scale open-source synthetic safety dataset with 262K vanilla (direct request) and adversarial (complex jailbreak) prompt-response pairs. Data: https://huggingface.co/datasets/allenai/wildjailbreak |
| Dataset Splits | Yes | We augment Tulu2Mix-no-refusal4 [37], a general capability instruction-tuning dataset consisting of 300K examples, with 200K examples sampled from WILDJAILBREAK, resulting in 500K examples. From WILDJAILBREAK we sample 50K each of vanilla harmful, adversarial harmful, vanilla benign, and adversarial benign items. For all training experiments, we follow the setup introduced in Tulu2 [37] and fine-tune a Llama2 7B base model on our 500K data mixture for 2 epochs. |
| Hardware Specification | Yes | We quantitatively compare the runtime and computational resources required for WILDTEAMING and other baselines, using NVIDIA RTX A6000 GPUs and Tulu2 DPO 7B as the target model. |
| Software Dependencies | No | Our training code was adopted from the Easy LM codebase [25]. Table 26 shows the training hyperparameters. It does not list specific software versions for libraries like PyTorch or other dependencies beyond the codebase name. |
| Experiment Setup | Yes | For all training experiments, we follow the setup introduced in Tulu2 [37] and fine-tune a Llama2 7B base model on our 500K data mixture for 2 epochs. Table 26: Hyperparameters used for instruction-tuning/supervised fine-tuning, consistent with the setup as [36] except that we choose a shorter max sequence length and smaller batch size due to compute constraint. Precision BFloat16 Epochs 2 Weight decay 0 Warmup ratio 0.03 Learning rate 2e-5 Max. seq. length 2048 Batch size 32 |