reproducibilityindex.ai

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

Authors: Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, Nouha Dziri

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive model training and evaluations, we identify the training properties that enable an ideal balance of safety behaviors: appropriate safeguarding without over-refusal, effective handling of both vanilla and adversarial queries, and minimal, if any, decrease in general capabilities. All the components of WILDJAILBREAK contribute to achieving balanced safety behaviors of models.
Researcher Affiliation	Collaboration	1University of Washington 2Allen Institute for Artificial Intelligence 3Seoul National University 4Carnegie Mellon University
Pseudocode	No	The paper includes structured instruction prompts (Prompt 1, Prompt 2, Prompt 3) but does not label them as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	Code & Models: https://github.com/allenai/wildteaming Data: https://huggingface.co/datasets/allenai/wildjailbreak
Open Datasets	Yes	Therefore, with WILDTEAMING we create WILDJAILBREAK, a large-scale open-source synthetic safety dataset with 262K vanilla (direct request) and adversarial (complex jailbreak) prompt-response pairs. Data: https://huggingface.co/datasets/allenai/wildjailbreak
Dataset Splits	Yes	We augment Tulu2Mix-no-refusal4 [37], a general capability instruction-tuning dataset consisting of 300K examples, with 200K examples sampled from WILDJAILBREAK, resulting in 500K examples. From WILDJAILBREAK we sample 50K each of vanilla harmful, adversarial harmful, vanilla benign, and adversarial benign items. For all training experiments, we follow the setup introduced in Tulu2 [37] and fine-tune a Llama2 7B base model on our 500K data mixture for 2 epochs.
Hardware Specification	Yes	We quantitatively compare the runtime and computational resources required for WILDTEAMING and other baselines, using NVIDIA RTX A6000 GPUs and Tulu2 DPO 7B as the target model.
Software Dependencies	No	Our training code was adopted from the Easy LM codebase [25]. Table 26 shows the training hyperparameters. It does not list specific software versions for libraries like PyTorch or other dependencies beyond the codebase name.
Experiment Setup	Yes	For all training experiments, we follow the setup introduced in Tulu2 [37] and fine-tune a Llama2 7B base model on our 500K data mixture for 2 epochs. Table 26: Hyperparameters used for instruction-tuning/supervised fine-tuning, consistent with the setup as [36] except that we choose a shorter max sequence length and smaller batch size due to compute constraint. Precision BFloat16 Epochs 2 Weight decay 0 Warmup ratio 0.03 Learning rate 2e-5 Max. seq. length 2048 Batch size 32