PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

Authors: Ziyang Zhang, Qizhen Zhang, Jakob Nicolaus Foerster

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically verify the effectiveness of our method and show that PARDEN significantly outperforms existing jailbreak detection baselines for Llama-2 and Claude-2.
Researcher Affiliation Academia 1University of Oxford 2FLAIR, University of Oxford. Correspondence to: Ziyang Zhang <ziyang.zhang@sjc.ox.ac.uk>.
Pseudocode No The paper describes the PARDEN method formally using equations (e.g., REPEAT(y) := LLM([prefix; examples; y ;suffix; examples])), but does not include any explicit pseudocode blocks or algorithm listings.
Open Source Code Yes Code and data are available at https://github.com/Ed-Zh/PARDEN.
Open Datasets Yes To collect benign examples, we sample 552 instructions from open-instruct-v1 Wang et al. (2023), and produce benign outputs using Llama2 and Claude-2.1. To produce jailbreak examples, we follow Zou et al. (2023) to adversarially attack the LLMs using the 520 harmful behaviours in their Adv Bench. Since the original attacks only result in 60/520 jailbreaks, we further leverage prompt injection to improve the attack success rate, and manually filter 484 true jailbreaks for Llama2 and 539 for Claude-2.1 . See our open-source dataset at https://github.com/Ed-Zh/PARDEN for details.
Dataset Splits No The paper describes collecting a dataset for 'fair evaluation' and organizing data into 4-tuples, but it does not explicitly provide training/validation/test splits for their experimental setup or for PARDEN's classification process. PARDEN itself is not trained but rather a method leveraged on LLMs.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. It mentions the LLMs used (Llama2-7B, Claude-2.1) but not the computational resources for their own evaluation.
Software Dependencies No The paper mentions using NLTK (Bird & Loper, 2004) for BLEU score computation, but it does not specify version numbers for NLTK or any other key software dependencies required to replicate the experiments.
Experiment Setup Yes When configuring the LLM for PARDEN, we use temperature = 0 to evaluate greedily. This is because for repetition, stochastic sampling would introduce extra noise and should be avoided. A temperature of 0 ensures PARDEN does not sample from a probability distribution. We clarify that the original generation need not have a temperature of 0.