Curiosity-driven Red-teaming for Large Language Models

Authors: Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James R. Glass, Akash Srivastava, Pulkit Agrawal

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments investigate whether curiosity-driven exploration generates diverse and high-quality test cases. To do so, we perform red teaming against LLMs with various red team approaches in two tasks, text continuation and instruction following... The evaluation reveals that the proposed method (CRT) increases the coverage of the generated test compared to current RL-based red-teaming methods.
Researcher Affiliation Collaboration Zhang-Wei Hong1,2 , Idan Shenfeld1,2, Tsun-Hsuan Wang,2, Yung-Sung Chuang2, Aldo Pareja3, James Glass,2, Akash Srivastava3, Pulkit Agrawal1,2 Improbable AI Lab1, Massachusetts Institute of Technology2, MIT-IBM Watson AI Lab3
Pseudocode No The paper describes its methods through prose and mathematical equations but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured, code-like steps.
Open Source Code Yes Code is available at https://github.com/Improbable-AI/curiosity_redteam
Open Datasets Yes We use GPT2 with 137M parameters as the target LLM p. For baselines and our method (Section 4.1), we sample the corpus in IMDb review dataset (Maas et al., 2011)... For instruction following tasks, we use the Alpaca dataset (Taori et al., 2023) and Databricks s Dolly15K dataset (Conover et al., 2023)...
Dataset Splits No The paper mentions training data, but it does not provide specific percentages or counts for training, validation, and test splits, nor does it refer to standard predefined splits that include validation for the datasets mentioned.
Hardware Specification No The paper mentions general computing environments like 'MIT Supercloud' and 'Lincoln Laboratory Supercomputing Center for providing HPC resources' in the acknowledgments, and states the model size (e.g., 'GPT2 model with 137M parameters'). However, it does not specify concrete hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies Yes We use GPT2 (Radford et al., 2019) model with 137M parameters... train the red team model π using proximal policy optimization (PPO) (Schulman et al., 2017b) implemented in trlx (Castricato et al., 2023)... We utilize the Ro BERTa hate speech classifier (Vidgen et al., 2021) with the checkpoint facebook/roberta-hate-speech-dynabench-r4-target hosted in Hugging Face, to predict the toxicity probability of target LLM responses. For Section 4.4, we use the Ro BERTa toxicity classifier (tomh/toxigen roberta8) trained with Toxygen dataset (Hartvigsen et al., 2022).
Experiment Setup Yes The hyperparameters are listed in the following Tables 4 and 5. The length of training is determined by the first method, reaching the average rewards > 0.9 for consecutive 10 epochs... We use 1000 epochs for text continuation tasks and 300 epochs for instruction-following tasks. For our curiosity-driven exploration method, we set the weight of Self BLEU reward (BSelf BLEU) as λB = 1.0, embedding cosine similarity rewards (BCos) as λC = 1.0, and entropy bonus as λE = 0.01.