Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
Authors: Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Tim Rocktäschel, Roberta Raileanu
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of RAINBOW TEAMING through extensive experiments targeting several state-of-the-art LLMs fine-tuned on safety-aligned data, including the Llama 2-chat [65] and Llama 3-Instruct [1] models. |
| Researcher Affiliation | Collaboration | Mikayel Samvelyan 1,2 Sharath Chandra Raparthy 1 Andrei Lupu 1,3 Eric Hambro1 Aram H. Markosyan1 Manish Bhatt1 Yuning Mao1 Minqi Jiang1,2 Jack Parker-Holder2 Jakob Foerster1,3 Tim Rocktäschel2 Roberta Raileanu1,2 1Meta 2University College London 3University of Oxford |
| Pseudocode | Yes | Algorithm 1 in Appendix B provides the pseudocode of this method." and "Algorithm 2 in Appendix B provides the pseudocode of our method. |
| Open Source Code | No | While we have not open-sourced our code or our synthetic data alongside our paper, we are assessing the safety and legal concerns of doing so at a future date. |
| Open Datasets | Yes | Table 11: List of hyperparameters used in question answering experiments. Experiments Hyperparameter Value RAINBOW TEAMING Number of Initial Examples 256 Dataset of Initial Examples Trivia QA [30] |
| Dataset Splits | Yes | We perform a 12/3 train-test split and use Llama 2-chat 70B with a handcrafted system prompt to generate safe refusal prompts for the train set. |
| Hardware Specification | Yes | We conducted our experiments on a cluster of A100 GPUs, with access ranging from 128 to 256 GPUs throughout the project. |
| Software Dependencies | No | The paper mentions various LLMs and tools used (e.g., "Llama 2-chat 7B", "Llama 3-Instruct 8B", "GPT-4", "Llama Guard"), but does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Table 10: List of hyperparameters used in safety experiments. Experiments Hyperparameter Value RAINBOW TEAMING Number of Initial Examples 0 Batch Size 32 Iterations 2000 BLEU Similarity Filter 0.6 Archive Sampling Temperature 0.1 Archive Size 100 Generator Parameters Temperature 0.7 Top-k 0.95 Maximum Tokens 256 Learning Rate 2e 7 Batch Size 32 Learning Rate Scheduler Constant Sequence Length 4096 |