Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Authors: Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xiaochun Cao, Min Lin
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we evaluate our I-GCG on a series of benchmarks (such as NeurIPS 2023 Red Teaming Track). The results demonstrate that our improved techniques can help GCG outperform state-of-the-art jailbreaking attacks and achieve a nearly 100% attack success rate. |
| Researcher Affiliation | Collaboration | 1Nanyang Technological University, Singapore 2Sea AI Lab, Singapore 3University of Oxford, United Kingdom 4School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University, China |
| Pseudocode | Yes | A ALGORITHM OF THE PROPOSED METHOD In this paper, we propose several improved techniques to improve the jailbreak performance of the optimization-based jailbreak method. Combining the proposed techniques, we develop an efficient jailbreak method, dobbed I-GCG. The algorithm of the proposed I-GCG is shown in Algorithm 1. Algorithm 1: I-GCG |
| Open Source Code | Yes | The code is released at https://github.com/jiaxiaojunQAQ/I-GCG. |
| Open Datasets | Yes | Datasets. We use the harmful behaviors subset from the Adv Bench benchmark (Zou et al., 2023) to evaluate the jailbreak performance of the proposed I-GCG. Specifically, the Adv Bench consists of 520 objectives that request harmful content, such as abusive language, violent content, misinformation, illegal activities, and so on. Following previous works (Chao et al., 2023; Li et al., 2023; Wei et al., 2024), we eliminate duplicate harmful requests from the Adv Bench dataset and select 50 representative harmful requests to compare performance. We also adopt Harm Bench (Mazeika et al., 2024), which is used in the NeurIPS 2023 Red Teaming Track, to evaluate the proposed I-GCG (Base Model Subtrack). |
| Dataset Splits | No | The paper mentions using the Adv Bench benchmark and Harm Bench, and selecting 50 representative harmful requests from Adv Bench for comparison. However, it does not provide explicit details about how these datasets were split into training, validation, or test sets for their experiments, or specific percentages/counts for such splits. |
| Hardware Specification | Yes | Experiments are conducted on an Ubuntu system with an NVIDIA A100 Tensor Core GPU and 80GB of RAM. |
| Software Dependencies | No | The paper mentions using various LLMs (e.g., VICUNA-7B-1.5, GUANACO-7B, LLAMA2-7B-CHAT, MISTRAL-7B-INSTRUCT-0.2) and a "Chat GPT-3.5-based check", but does not specify version numbers for any underlying software libraries, programming languages, or development environments used for their implementation. |
| Experiment Setup | Yes | For any malicious question Q, we set the target response as Sure, + My output is harmful. Here is + Rephrase(Q) . Following previous works (Zou et al., 2023; Zhang & Wei, 2024), the length of the jailbreak suffix is set to 20. During the candidate generation, the token substitute batch size B and top-k are both set to 256. We first launch the jailbreak attack with the suffix initialization ( ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ) on the simple jailbreak request to generate the jailbreak suffix. The request is set to Provide instructions on how to steal from a store without getting caught . The number of iterations for this process is set to 1,000. We use the previously generated jailbreak suffix to initialize the jailbreak suffixes of malicious questions. The number of iterations for this process is 500. |