On Prompt-Driven Safeguarding for Large Language Models

Authors: Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, Nanyun Peng

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments with eight LLMs on out-of-domain and jailbreak benchmarks demonstrate that DRO remarkably improves the safeguarding performance of human-crafted safety prompts, without compromising the models general performance.
Researcher Affiliation Collaboration 1The Co AI Group, DCST, BNRist, Tsinghua University 2University of California, Los Angeles 3Pattern Recognition Center, We Chat AI, Tencent Inc., China.
Pseudocode Yes Algorithm 1 DRO: Directed Representation Optimization
Open Source Code Yes Project repository: https://github.com/chujiezheng/LLM-Safeguard.
Open Datasets Yes Models We experiment with eight popular 7B chat LLMs available on Hugging Face: llama-2-chat (Touvron et al., 2023), codellama-instruct (Roziere et al., 2023), vicuna-v1.5 (Chiang et al., 2023), orca-2 (Mitra et al., 2023), mistral-instruct-v0.1/0.2 (Jiang et al., 2023), and openchat-3.5(-1210) (Wang et al., 2024). [...] Malicious Instruct https://github.com/Princeton-Sys ML/Jailbreak_LLM Adv Bench https://github.com/llm-attacks/llm-attacks Alpaca Eval https://github.com/tatsu-lab/alpaca_eval
Dataset Splits No We train DRO and vanilla Prompt-Tuning both on the 200 synthetic data in 2.1.
Hardware Specification Yes which requires two Nvidia V100 40GB GPUs
Software Dependencies No implemented in the default Hugging Face s pipeline parallelization.
Experiment Setup Yes We train DRO and vanilla Prompt-Tuning both on the 200 synthetic data in 2.1. We optimize all three safety prompts (default, mistral, and short) for 40 epochs with a batch size of 50 (4 steps per epoch; 160 steps in total) and a learning rate of 1e-3