RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content

Authors: Zhuowen Yuan, Zidi Xiong, Yi Zeng, Ning Yu, Ruoxi Jia, Dawn Song, Bo Li

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental evaluations demonstrate that Rigor LLM not only outperforms existing baselines like Open AI API and Perspective API in detecting harmful content but also exhibits unparalleled resilience to jailbreaking attacks.
Researcher Affiliation Collaboration 1University of Illinois Urbana-Champaign 2Virginia Tech 3Salesforce Research 4University of California Berkeley 5University of Chicago.
Pseudocode Yes Algorithm 1 Energy-based data generation.
Open Source Code Yes Our code is available at https: //github.com/eurekayuan/Rigor LLM.
Open Datasets Yes The training data of Rigor LLM consists of harmful instructions from HEx-PHI (Qi et al., 2024), benign instructions from Hotpot QA (Yang et al., 2018) and MT-bench (Zheng et al., 2023), and the validation data from Open AI Moderation Dataset (Markov et al., 2023) and Toxic Chat (Lin et al., 2023). ... All the datasets are publicly available.
Dataset Splits Yes The training data of Rigor LLM consists of harmful instructions from HEx-PHI (Qi et al., 2024), benign instructions from Hotpot QA (Yang et al., 2018) and MT-bench (Zheng et al., 2023), and the validation data from Open AI Moderation Dataset (Markov et al., 2023) and Toxic Chat (Lin et al., 2023). We use all the 330 harmful instructions of HEx-PHI, which belong to 11 prohibited categories. Besides, we include 1,000 queries from Hotpot QA and 80 queries from MT-bench for the benign category. Open AI Moderation Dataset consists of 1,680 prompt examples sampled from public data and annotated according to its own taxonomy. We randomly sampled 129 queries as validation data (15 instances from each category) for energy-based data generation. The remaining 1,551 prompts are used for evaluation, of which 522 were labeled as harmful. For Toxic Chat, we use the first 1,000 records from its testing dataset, consisting of 223 toxic prompts and 777 benign prompts. We use the first 1,000 records from its training data as validation data.
Hardware Specification Yes All our experiments are conducted on a single NVIDIA A6000 Ada GPU.
Software Dependencies No The paper mentions specific models (Llama2-7B, Vicuna-7B, Llama Guard) and algorithms (GCG) but does not provide version numbers for general software dependencies like programming languages or libraries (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes For resilient optimization, we alternatively fix the safe suffix or the adversarial suffix and optimize the other with GCG algorithm (Zou et al., 2023) on Vicuna-7B (Zheng et al., 2023). We use the default parameters of GCG. For k in probabilistic KNN and the weight α in prediction aggregation, we perform grid search to select the values that achieve the best performance. For the text encoder, we use Llama Guard. Specifically, we extract the hidden states of the last non-padding token predicted by Llama Guard as its embedding.