Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

Authors: Haibo Jin, Andy Zhou, Joe Menke, Haohan Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments on four LLMs demonstrate that JAM achieves higher jailbreak success ( 19.88) and lower filtered-out rates ( 1/6) than baselines.
Researcher Affiliation Academia Haibo Jin School of Information Sciences University of Illinois at Urbana-Champaign Champaign, IL 61820 haibo@illinois.edu Andy Zhou Computer Science Lapis Labs University of Illinois Urbana-Champaign Champaign, IL 61820 andyz3@illinois.edu Joe D. Menke School of Information Sciences University of Illinois Urbana-Champaign Champaign, IL 61820 jmenke2@illinois.edu Haohan Wang School of Information Sciences University of Illinois Urbana-Champaign Champaign, IL 61820 haohanw@illinois.edu
Pseudocode Yes The pseudo-code is shown in the Algorithm 1.
Open Source Code Yes We will publish the comprehensive results of our experiment and the jailbreaks on the web. For detailed information, please visit the following link: https://github.com/Allen-piexl/llm_moderation_attack.
Open Datasets Yes We use harmful texts from the Toxic Comment Classification Challenge [32] to simulate the output of LLMs...[32] Cjadams, Sorensen Jeffrey, Elliott Julia, Dixon Lucas, Mc Donald Mark, nithum, and Cukierski Will. Toxic comment classification challenge, 2017.
Dataset Splits No The paper describes fine-tuning a model and running experiments but does not explicitly provide specific train/validation/test dataset splits (e.g., percentages or sample counts) for its main experiments.
Hardware Specification Yes All experiments are conducted on one Tesla A100 GPU 80G.
Software Dependencies No The paper mentions using "toxic-bert model [33]" and fine-tuning it, but it does not specify version numbers for this or any other core software dependencies (e.g., Python, PyTorch, TensorFlow versions) that would allow for reproducible setup.
Experiment Setup Yes We fine-tuned toxic-bert [33] using 80 epochs as the shadow model. We initial the length of cipher characters with 20 tokens, and optimize for 100 steps using a batch size of 64, top-k of 256.