reproducibilityindex.ai

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

Authors: Haibo Jin, Andy Zhou, Joe Menke, Haohan Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments on four LLMs demonstrate that JAM achieves higher jailbreak success ( 19.88) and lower filtered-out rates ( 1/6) than baselines.
Researcher Affiliation	Academia	Haibo Jin School of Information Sciences University of Illinois at Urbana-Champaign Champaign, IL 61820 haibo@illinois.edu Andy Zhou Computer Science Lapis Labs University of Illinois Urbana-Champaign Champaign, IL 61820 andyz3@illinois.edu Joe D. Menke School of Information Sciences University of Illinois Urbana-Champaign Champaign, IL 61820 jmenke2@illinois.edu Haohan Wang School of Information Sciences University of Illinois Urbana-Champaign Champaign, IL 61820 haohanw@illinois.edu
Pseudocode	Yes	The pseudo-code is shown in the Algorithm 1.
Open Source Code	Yes	We will publish the comprehensive results of our experiment and the jailbreaks on the web. For detailed information, please visit the following link: https://github.com/Allen-piexl/llm_moderation_attack.
Open Datasets	Yes	We use harmful texts from the Toxic Comment Classification Challenge [32] to simulate the output of LLMs...[32] Cjadams, Sorensen Jeffrey, Elliott Julia, Dixon Lucas, Mc Donald Mark, nithum, and Cukierski Will. Toxic comment classification challenge, 2017.
Dataset Splits	No	The paper describes fine-tuning a model and running experiments but does not explicitly provide specific train/validation/test dataset splits (e.g., percentages or sample counts) for its main experiments.
Hardware Specification	Yes	All experiments are conducted on one Tesla A100 GPU 80G.
Software Dependencies	No	The paper mentions using "toxic-bert model [33]" and fine-tuning it, but it does not specify version numbers for this or any other core software dependencies (e.g., Python, PyTorch, TensorFlow versions) that would allow for reproducible setup.
Experiment Setup	Yes	We fine-tuned toxic-bert [33] using 80 epochs as the shadow model. We initial the length of cipher characters with 20 tokens, and optimize for 100 steps using a batch size of 64, top-k of 256.