reproducibilityindex.ai

Mission Impossible: A Statistical Perspective on Jailbreaking LLMs

Authors: Jingtong Su, Julia Kempe, Karen Ullrich

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our paper provides theoretical insights into the phenomenon of preference alignment and jailbreaking from a statistical perspective. Under our framework, we first show that pretrained LLMs will mimic harmful behaviour if present in the training corpus. Under that same framework, we then introduce a statistical notion of alignment, and lower-bound the jailbreaking probability, showing that it is unpreventable under reasonable assumptions. Based on our insights, we propose an alteration to the currently prevalent alignment strategy RLHF. Specifically, we introduce a simple modification to the RLHF objective, we call E-RLHF, that aims to increase the likelihood of safe responses. E-RLHF brings no additional training cost, and is compatible with other methods. Empirically, we demonstrate that E-RLHF outperforms RLHF on all alignment problems put forward by the Adv Bench [1] and Harm Bench project [2] without sacrificing model performance as measured by the MT-Bench project [3].
Researcher Affiliation	Collaboration	Jingtong Su NYU & Meta AI, FAIR Julia Kempe NYU & Meta AI, FAIR Karen Ullrich Meta AI, FAIR
Pseudocode	No	The paper describes methods through prose and mathematical equations but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	We do not use our own model or data for training. Thus, we do not provide any data or code by ourselves. All of them are publicly available.
Open Datasets	Yes	We tune the publicly-available SFT model p SFT provided by huggingface hub [51], using the public dataset [52, 53], with default hyperparameter setup.
Dataset Splits	No	The paper mentions using 'public dataset' and 'preference dataset' but does not explicitly provide specific percentages or counts for training, validation, or test splits. It refers to 'default hyperparameter setup' which might imply standard splits, but these are not specified in the text.
Hardware Specification	Yes	Experiments are performed on 8 NVIDIA Tesla V100 GPUs, using half-precision tuning i.e., Float16.
Software Dependencies	No	The paper mentions using 'alignment-handbook code base' and 'huggingface hub' but does not specify exact version numbers for these or other software dependencies.
Experiment Setup	No	The paper states using 'default hyperparameter setup' and specifies 'greedy decoding i.e., T = 0 for model evaluation', but it does not provide a comprehensive list of specific hyperparameters for the training setup, such as learning rates, batch sizes, or number of epochs.