Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters
Authors: Haibo Jin, Andy Zhou, Joe Menke, Haohan Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments on four LLMs demonstrate that JAM achieves higher jailbreak success ( 19.88) and lower filtered-out rates ( 1/6) than baselines. |
| Researcher Affiliation | Academia | Haibo Jin School of Information Sciences University of Illinois at Urbana-Champaign Champaign, IL 61820 haibo@illinois.edu Andy Zhou Computer Science Lapis Labs University of Illinois Urbana-Champaign Champaign, IL 61820 andyz3@illinois.edu Joe D. Menke School of Information Sciences University of Illinois Urbana-Champaign Champaign, IL 61820 jmenke2@illinois.edu Haohan Wang School of Information Sciences University of Illinois Urbana-Champaign Champaign, IL 61820 haohanw@illinois.edu |
| Pseudocode | Yes | The pseudo-code is shown in the Algorithm 1. |
| Open Source Code | Yes | We will publish the comprehensive results of our experiment and the jailbreaks on the web. For detailed information, please visit the following link: https://github.com/Allen-piexl/llm_moderation_attack. |
| Open Datasets | Yes | We use harmful texts from the Toxic Comment Classification Challenge [32] to simulate the output of LLMs...[32] Cjadams, Sorensen Jeffrey, Elliott Julia, Dixon Lucas, Mc Donald Mark, nithum, and Cukierski Will. Toxic comment classification challenge, 2017. |
| Dataset Splits | No | The paper describes fine-tuning a model and running experiments but does not explicitly provide specific train/validation/test dataset splits (e.g., percentages or sample counts) for its main experiments. |
| Hardware Specification | Yes | All experiments are conducted on one Tesla A100 GPU 80G. |
| Software Dependencies | No | The paper mentions using "toxic-bert model [33]" and fine-tuning it, but it does not specify version numbers for this or any other core software dependencies (e.g., Python, PyTorch, TensorFlow versions) that would allow for reproducible setup. |
| Experiment Setup | Yes | We fine-tuned toxic-bert [33] using 80 epochs as the shadow model. We initial the length of cipher characters with 20 tokens, and optimize for 100 steps using a batch size of 64, top-k of 256. |