reproducibilityindex.ai

PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

Authors: Ziyang Zhang, Qizhen Zhang, Jakob Nicolaus Foerster

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically verify the effectiveness of our method and show that PARDEN significantly outperforms existing jailbreak detection baselines for Llama-2 and Claude-2.
Researcher Affiliation	Academia	1University of Oxford 2FLAIR, University of Oxford. Correspondence to: Ziyang Zhang <ziyang.zhang@sjc.ox.ac.uk>.
Pseudocode	No	The paper describes the PARDEN method formally using equations (e.g., REPEAT(y) := LLM([prefix; examples; y ;suffix; examples])), but does not include any explicit pseudocode blocks or algorithm listings.
Open Source Code	Yes	Code and data are available at https://github.com/Ed-Zh/PARDEN.
Open Datasets	Yes	To collect benign examples, we sample 552 instructions from open-instruct-v1 Wang et al. (2023), and produce benign outputs using Llama2 and Claude-2.1. To produce jailbreak examples, we follow Zou et al. (2023) to adversarially attack the LLMs using the 520 harmful behaviours in their Adv Bench. Since the original attacks only result in 60/520 jailbreaks, we further leverage prompt injection to improve the attack success rate, and manually filter 484 true jailbreaks for Llama2 and 539 for Claude-2.1 . See our open-source dataset at https://github.com/Ed-Zh/PARDEN for details.
Dataset Splits	No	The paper describes collecting a dataset for 'fair evaluation' and organizing data into 4-tuples, but it does not explicitly provide training/validation/test splits for their experimental setup or for PARDEN's classification process. PARDEN itself is not trained but rather a method leveraged on LLMs.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. It mentions the LLMs used (Llama2-7B, Claude-2.1) but not the computational resources for their own evaluation.
Software Dependencies	No	The paper mentions using NLTK (Bird & Loper, 2004) for BLEU score computation, but it does not specify version numbers for NLTK or any other key software dependencies required to replicate the experiments.
Experiment Setup	Yes	When configuring the LLM for PARDEN, we use temperature = 0 to evaluate greedily. This is because for repetition, stochastic sampling would introduce extra noise and should be avoided. A temperature of 0 ensures PARDEN does not sample from a probability distribution. We clarify that the original generation need not have a temperature of 0.