Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters
Authors: Haibo Jin, Andy Zhou, Joe Menke, Haohan Wang
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments on four LLMs demonstrate that JAM achieves higher jailbreak success ( 19.88) and lower filtered-out rates ( 1/6) than baselines. |
| Researcher Affiliation | Academia | Haibo Jin School of Information Sciences University of Illinois at Urbana-Champaign Champaign, IL 61820 EMAIL Andy Zhou Computer Science Lapis Labs University of Illinois Urbana-Champaign Champaign, IL 61820 EMAIL Joe D. Menke School of Information Sciences University of Illinois Urbana-Champaign Champaign, IL 61820 EMAIL Haohan Wang School of Information Sciences University of Illinois Urbana-Champaign Champaign, IL 61820 EMAIL |
| Pseudocode | Yes | The pseudo-code is shown in the Algorithm 1. |
| Open Source Code | Yes | We will publish the comprehensive results of our experiment and the jailbreaks on the web. For detailed information, please visit the following link: https://github.com/Allen-piexl/llm_moderation_attack. |
| Open Datasets | Yes | We use harmful texts from the Toxic Comment Classification Challenge [32] to simulate the output of LLMs...[32] Cjadams, Sorensen Jeffrey, Elliott Julia, Dixon Lucas, Mc Donald Mark, nithum, and Cukierski Will. Toxic comment classification challenge, 2017. |
| Dataset Splits | No | The paper describes fine-tuning a model and running experiments but does not explicitly provide specific train/validation/test dataset splits (e.g., percentages or sample counts) for its main experiments. |
| Hardware Specification | Yes | All experiments are conducted on one Tesla A100 GPU 80G. |
| Software Dependencies | No | The paper mentions using "toxic-bert model [33]" and fine-tuning it, but it does not specify version numbers for this or any other core software dependencies (e.g., Python, PyTorch, TensorFlow versions) that would allow for reproducible setup. |
| Experiment Setup | Yes | We fine-tuned toxic-bert [33] using 80 epochs as the shadow model. We initial the length of cipher characters with 20 tokens, and optimize for 100 steps using a batch size of 64, top-k of 256. |