On Prompt-Driven Safeguarding for Large Language Models
Authors: Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, Nanyun Peng
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments with eight LLMs on out-of-domain and jailbreak benchmarks demonstrate that DRO remarkably improves the safeguarding performance of human-crafted safety prompts, without compromising the models general performance. |
| Researcher Affiliation | Collaboration | 1The Co AI Group, DCST, BNRist, Tsinghua University 2University of California, Los Angeles 3Pattern Recognition Center, We Chat AI, Tencent Inc., China. |
| Pseudocode | Yes | Algorithm 1 DRO: Directed Representation Optimization |
| Open Source Code | Yes | Project repository: https://github.com/chujiezheng/LLM-Safeguard. |
| Open Datasets | Yes | Models We experiment with eight popular 7B chat LLMs available on Hugging Face: llama-2-chat (Touvron et al., 2023), codellama-instruct (Roziere et al., 2023), vicuna-v1.5 (Chiang et al., 2023), orca-2 (Mitra et al., 2023), mistral-instruct-v0.1/0.2 (Jiang et al., 2023), and openchat-3.5(-1210) (Wang et al., 2024). [...] Malicious Instruct https://github.com/Princeton-Sys ML/Jailbreak_LLM Adv Bench https://github.com/llm-attacks/llm-attacks Alpaca Eval https://github.com/tatsu-lab/alpaca_eval |
| Dataset Splits | No | We train DRO and vanilla Prompt-Tuning both on the 200 synthetic data in 2.1. |
| Hardware Specification | Yes | which requires two Nvidia V100 40GB GPUs |
| Software Dependencies | No | implemented in the default Hugging Face s pipeline parallelization. |
| Experiment Setup | Yes | We train DRO and vanilla Prompt-Tuning both on the 200 synthetic data in 2.1. We optimize all three safety prompts (default, mistral, and short) for 40 epochs with a batch size of 50 (4 steps per epoch; 160 steps in total) and a learning rate of 1e-3 |