Fight Back Against Jailbreaking via Prompt Adversarial Tuning

Authors: Yichuan Mo, Yuji Wang, Zeming Wei, Yisen Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments show that our method is effective against both grey-box and black-box attacks, reducing the success rate of advanced attacks to nearly 0%, while maintaining the model s utility on the benign task and incurring only negligible computational overhead, charting a new perspective for future explorations in LLM security.
Researcher Affiliation Academia 1 State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2 Shanghai Jiao Tong University 3 School of Mathematical Sciences, Peking University 4 Institute for Artificial Intelligence, Peking University
Pseudocode Yes Algorithm 1 Prompt Adversarial Tuning (PAT)
Open Source Code Yes Our code is available at https://github.com/PKU-ML/PAT.
Open Datasets Yes We performed experiments on the Advbench dataset [70] which is one of the most prevailing benchmark datasets to measure the security of LLMs.
Dataset Splits No The paper describes using 'Three sets of dialogue data... including harmful prompts and targets..., harmful prompts and safety targets..., benign prompts and goals' and specifies acquiring 25 harmful prompts and 100 benign prompts. However, it does not provide explicit training, validation, or test dataset splits (e.g., percentages, sample counts, or predefined splits) in the traditional sense.
Hardware Specification Yes All the experiments are performed on one or multiple NVIDIA A100 80G GPUs.
Software Dependencies No The paper does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, or other libraries with their respective versions) used in the experiments.
Experiment Setup Yes The hyperparameter settings for PAT during our tuning process are as follows: The number of prompts, m for control optimization is 25. As for the control length, the length of attack control is 20, and the length of defense control is 15. We iteratively update the controls for 100 epochs. During the token selection, the token set size k is chosen as 256 and the batch size B is 512.