Protecting Your LLMs with Information Bottleneck
Authors: Zichuan Liu, Zefan Wang, Linjie Xu, Jinyu Wang, Lei Song, Tianchun Wang, Chunlin Chen, Wei Cheng, Jiang Bian
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical evaluations show that IBProtector outperforms current defense methods in mitigating jailbreak attempts, without overly affecting response quality or inference speed. We evaluate IBProtector on token-level and prompt-level jailbreaking datasets. The results show that IBProtector can successfully defend against adversarial prompts without substantially affecting LLMs responsiveness and inference consumption. |
| Researcher Affiliation | Collaboration | Zichuan Liu1,2 , Zefan Wang3 , Linjie Xu2,4, Jinyu Wang2, Lei Song2 , Tianchun Wang5, Chunlin Chen1, Wei Cheng6, Jiang Bian2 1Nanjing University, 2Microsoft Research Asia, 3Tsinghua University, 4Queen Mary University of London, 5Pennsylvania State University, 6NEC Laboratories America |
| Pseudocode | Yes | Appendix C: Algorithm 1 The pseudo-code of IBProtector |
| Open Source Code | Yes | The interested reader can refer to our code for more details: https://github.com/zichuan-liu/IB4LLMs. |
| Open Datasets | Yes | We mainly evaluate our IBProtector on three datasets: Adv Bench [2], Trivia QA [36], and Easy Jailbreak [37]. Easy Jailbreak s jailbreak results (https://github.com/Easy Jailbreak/Easy Jailbreak/ tree/master?tab=readme-ov-file#-experimental-results) as adversarial prompts. |
| Dataset Splits | No | The paper specifies training and test dataset sizes but does not explicitly mention a separate validation set for model training or how it was used if present. |
| Hardware Specification | Yes | For computational resources, all our experiments are performed on a cluster with one NVIDIA 80GB Tesla A100 GPU and 4 NVIDIA Tesla 40GB V100 GPUs, where the cuda version is 12.2. |
| Software Dependencies | Yes | For computational resources, all our experiments are performed on a cluster with one NVIDIA 80GB Tesla A100 GPU and 4 NVIDIA Tesla 40GB V100 GPUs, where the cuda version is 12.2. The template for each model uses Fast Chat version 0.2.20, which is consistent with GCG (https://github.com/llm-attacks/llm-attacks/blob/main/requirements.txt). |
| Experiment Setup | Yes | Our default hyperparameters of loss weights are set as α = 0.5, λ = 1.0, and r = 0.5. We set the learning rate to be 2e-5 and the epoch to be 3 for training IBProtector and we choose Adam W as our optimizer. For LLM generation, we use greedy decoding with do_sample=False and a top-p value of 1.0 for better reproducibility. |