reproducibilityindex.ai

On Prompt-Driven Safeguarding for Large Language Models

Authors: Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, Nanyun Peng

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments with eight LLMs on out-of-domain and jailbreak benchmarks demonstrate that DRO remarkably improves the safeguarding performance of human-crafted safety prompts, without compromising the models general performance.
Researcher Affiliation	Collaboration	1The Co AI Group, DCST, BNRist, Tsinghua University 2University of California, Los Angeles 3Pattern Recognition Center, We Chat AI, Tencent Inc., China.
Pseudocode	Yes	Algorithm 1 DRO: Directed Representation Optimization
Open Source Code	Yes	Project repository: https://github.com/chujiezheng/LLM-Safeguard.
Open Datasets	Yes	Models We experiment with eight popular 7B chat LLMs available on Hugging Face: llama-2-chat (Touvron et al., 2023), codellama-instruct (Roziere et al., 2023), vicuna-v1.5 (Chiang et al., 2023), orca-2 (Mitra et al., 2023), mistral-instruct-v0.1/0.2 (Jiang et al., 2023), and openchat-3.5(-1210) (Wang et al., 2024). [...] Malicious Instruct https://github.com/Princeton-Sys ML/Jailbreak_LLM Adv Bench https://github.com/llm-attacks/llm-attacks Alpaca Eval https://github.com/tatsu-lab/alpaca_eval
Dataset Splits	No	We train DRO and vanilla Prompt-Tuning both on the 200 synthetic data in 2.1.
Hardware Specification	Yes	which requires two Nvidia V100 40GB GPUs
Software Dependencies	No	implemented in the default Hugging Face s pipeline parallelization.
Experiment Setup	Yes	We train DRO and vanilla Prompt-Tuning both on the 200 synthetic data in 2.1. We optimize all three safety prompts (default, mistral, and short) for 40 epochs with a batch size of 50 (4 steps per epoch; 160 steps in total) and a learning rate of 1e-3