Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
On Prompt-Driven Safeguarding for Large Language Models
Authors: Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, Nanyun Peng
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments with eight LLMs on out-of-domain and jailbreak benchmarks demonstrate that DRO remarkably improves the safeguarding performance of human-crafted safety prompts, without compromising the models general performance. |
| Researcher Affiliation | Collaboration | 1The Co AI Group, DCST, BNRist, Tsinghua University 2University of California, Los Angeles 3Pattern Recognition Center, We Chat AI, Tencent Inc., China. |
| Pseudocode | Yes | Algorithm 1 DRO: Directed Representation Optimization |
| Open Source Code | Yes | Project repository: https://github.com/chujiezheng/LLM-Safeguard. |
| Open Datasets | Yes | Models We experiment with eight popular 7B chat LLMs available on Hugging Face: llama-2-chat (Touvron et al., 2023), codellama-instruct (Roziere et al., 2023), vicuna-v1.5 (Chiang et al., 2023), orca-2 (Mitra et al., 2023), mistral-instruct-v0.1/0.2 (Jiang et al., 2023), and openchat-3.5(-1210) (Wang et al., 2024). [...] Malicious Instruct https://github.com/Princeton-Sys ML/Jailbreak_LLM Adv Bench https://github.com/llm-attacks/llm-attacks Alpaca Eval https://github.com/tatsu-lab/alpaca_eval |
| Dataset Splits | No | We train DRO and vanilla Prompt-Tuning both on the 200 synthetic data in 2.1. |
| Hardware Specification | Yes | which requires two Nvidia V100 40GB GPUs |
| Software Dependencies | No | implemented in the default Hugging Face s pipeline parallelization. |
| Experiment Setup | Yes | We train DRO and vanilla Prompt-Tuning both on the 200 synthetic data in 2.1. We optimize all three safety prompts (default, mistral, and short) for 40 epochs with a batch size of 50 (4 steps per epoch; 160 steps in total) and a learning rate of 1e-3 |