Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Authors: Jingtong Su, Julia Kempe, Karen Ullrich
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our paper provides theoretical insights into the phenomenon of preference alignment and jailbreaking from a statistical perspective. Under our framework, we first show that pretrained LLMs will mimic harmful behaviour if present in the training corpus. Under that same framework, we then introduce a statistical notion of alignment, and lower-bound the jailbreaking probability, showing that it is unpreventable under reasonable assumptions. Based on our insights, we propose an alteration to the currently prevalent alignment strategy RLHF. Specifically, we introduce a simple modification to the RLHF objective, we call E-RLHF, that aims to increase the likelihood of safe responses. E-RLHF brings no additional training cost, and is compatible with other methods. Empirically, we demonstrate that E-RLHF outperforms RLHF on all alignment problems put forward by the Adv Bench [1] and Harm Bench project [2] without sacrificing model performance as measured by the MT-Bench project [3]. |
| Researcher Affiliation | Collaboration | Jingtong Su NYU & Meta AI, FAIR Julia Kempe NYU & Meta AI, FAIR Karen Ullrich Meta AI, FAIR |
| Pseudocode | No | The paper describes methods through prose and mathematical equations but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | We do not use our own model or data for training. Thus, we do not provide any data or code by ourselves. All of them are publicly available. |
| Open Datasets | Yes | We tune the publicly-available SFT model p SFT provided by huggingface hub [51], using the public dataset [52, 53], with default hyperparameter setup. |
| Dataset Splits | No | The paper mentions using 'public dataset' and 'preference dataset' but does not explicitly provide specific percentages or counts for training, validation, or test splits. It refers to 'default hyperparameter setup' which might imply standard splits, but these are not specified in the text. |
| Hardware Specification | Yes | Experiments are performed on 8 NVIDIA Tesla V100 GPUs, using half-precision tuning i.e., Float16. |
| Software Dependencies | No | The paper mentions using 'alignment-handbook code base' and 'huggingface hub' but does not specify exact version numbers for these or other software dependencies. |
| Experiment Setup | No | The paper states using 'default hyperparameter setup' and specifies 'greedy decoding i.e., T = 0 for model evaluation', but it does not provide a comprehensive list of specific hyperparameters for the training setup, such as learning rates, batch sizes, or number of epochs. |