Protecting Your LLMs with Information Bottleneck

Authors: Zichuan Liu, Zefan Wang, Linjie Xu, Jinyu Wang, Lei Song, Tianchun Wang, Chunlin Chen, Wei Cheng, Jiang Bian

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical evaluations show that IBProtector outperforms current defense methods in mitigating jailbreak attempts, without overly affecting response quality or inference speed. We evaluate IBProtector on token-level and prompt-level jailbreaking datasets. The results show that IBProtector can successfully defend against adversarial prompts without substantially affecting LLMs responsiveness and inference consumption.
Researcher Affiliation Collaboration Zichuan Liu1,2 , Zefan Wang3 , Linjie Xu2,4, Jinyu Wang2, Lei Song2 , Tianchun Wang5, Chunlin Chen1, Wei Cheng6, Jiang Bian2 1Nanjing University, 2Microsoft Research Asia, 3Tsinghua University, 4Queen Mary University of London, 5Pennsylvania State University, 6NEC Laboratories America
Pseudocode Yes Appendix C: Algorithm 1 The pseudo-code of IBProtector
Open Source Code Yes The interested reader can refer to our code for more details: https://github.com/zichuan-liu/IB4LLMs.
Open Datasets Yes We mainly evaluate our IBProtector on three datasets: Adv Bench [2], Trivia QA [36], and Easy Jailbreak [37]. Easy Jailbreak s jailbreak results (https://github.com/Easy Jailbreak/Easy Jailbreak/ tree/master?tab=readme-ov-file#-experimental-results) as adversarial prompts.
Dataset Splits No The paper specifies training and test dataset sizes but does not explicitly mention a separate validation set for model training or how it was used if present.
Hardware Specification Yes For computational resources, all our experiments are performed on a cluster with one NVIDIA 80GB Tesla A100 GPU and 4 NVIDIA Tesla 40GB V100 GPUs, where the cuda version is 12.2.
Software Dependencies Yes For computational resources, all our experiments are performed on a cluster with one NVIDIA 80GB Tesla A100 GPU and 4 NVIDIA Tesla 40GB V100 GPUs, where the cuda version is 12.2. The template for each model uses Fast Chat version 0.2.20, which is consistent with GCG (https://github.com/llm-attacks/llm-attacks/blob/main/requirements.txt).
Experiment Setup Yes Our default hyperparameters of loss weights are set as α = 0.5, λ = 1.0, and r = 0.5. We set the learning rate to be 2e-5 and the epoch to be 3 for training IBProtector and we choose Adam W as our optimizer. For LLM generation, we use greedy decoding with do_sample=False and a top-p value of 1.0 for better reproducibility.