Curiosity-driven Red-teaming for Large Language Models
Authors: Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James R. Glass, Akash Srivastava, Pulkit Agrawal
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments investigate whether curiosity-driven exploration generates diverse and high-quality test cases. To do so, we perform red teaming against LLMs with various red team approaches in two tasks, text continuation and instruction following... The evaluation reveals that the proposed method (CRT) increases the coverage of the generated test compared to current RL-based red-teaming methods. |
| Researcher Affiliation | Collaboration | Zhang-Wei Hong1,2 , Idan Shenfeld1,2, Tsun-Hsuan Wang,2, Yung-Sung Chuang2, Aldo Pareja3, James Glass,2, Akash Srivastava3, Pulkit Agrawal1,2 Improbable AI Lab1, Massachusetts Institute of Technology2, MIT-IBM Watson AI Lab3 |
| Pseudocode | No | The paper describes its methods through prose and mathematical equations but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured, code-like steps. |
| Open Source Code | Yes | Code is available at https://github.com/Improbable-AI/curiosity_redteam |
| Open Datasets | Yes | We use GPT2 with 137M parameters as the target LLM p. For baselines and our method (Section 4.1), we sample the corpus in IMDb review dataset (Maas et al., 2011)... For instruction following tasks, we use the Alpaca dataset (Taori et al., 2023) and Databricks s Dolly15K dataset (Conover et al., 2023)... |
| Dataset Splits | No | The paper mentions training data, but it does not provide specific percentages or counts for training, validation, and test splits, nor does it refer to standard predefined splits that include validation for the datasets mentioned. |
| Hardware Specification | No | The paper mentions general computing environments like 'MIT Supercloud' and 'Lincoln Laboratory Supercomputing Center for providing HPC resources' in the acknowledgments, and states the model size (e.g., 'GPT2 model with 137M parameters'). However, it does not specify concrete hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | Yes | We use GPT2 (Radford et al., 2019) model with 137M parameters... train the red team model π using proximal policy optimization (PPO) (Schulman et al., 2017b) implemented in trlx (Castricato et al., 2023)... We utilize the Ro BERTa hate speech classifier (Vidgen et al., 2021) with the checkpoint facebook/roberta-hate-speech-dynabench-r4-target hosted in Hugging Face, to predict the toxicity probability of target LLM responses. For Section 4.4, we use the Ro BERTa toxicity classifier (tomh/toxigen roberta8) trained with Toxygen dataset (Hartvigsen et al., 2022). |
| Experiment Setup | Yes | The hyperparameters are listed in the following Tables 4 and 5. The length of training is determined by the first method, reaching the average rewards > 0.9 for consecutive 10 epochs... We use 1000 epochs for text continuation tasks and 300 epochs for instruction-following tasks. For our curiosity-driven exploration method, we set the weight of Self BLEU reward (BSelf BLEU) as λB = 1.0, embedding cosine similarity rewards (BCos) as λC = 1.0, and entropy bonus as λE = 0.01. |