Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Protecting Your LLMs with Information Bottleneck
Authors: Zichuan Liu, Zefan Wang, Linjie Xu, Jinyu Wang, Lei Song, Tianchun Wang, Chunlin Chen, Wei Cheng, Jiang Bian
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical evaluations show that IBProtector outperforms current defense methods in mitigating jailbreak attempts, without overly affecting response quality or inference speed. We evaluate IBProtector on token-level and prompt-level jailbreaking datasets. The results show that IBProtector can successfully defend against adversarial prompts without substantially affecting LLMs responsiveness and inference consumption. |
| Researcher Affiliation | Collaboration | Zichuan Liu1,2 , Zefan Wang3 , Linjie Xu2,4, Jinyu Wang2, Lei Song2 , Tianchun Wang5, Chunlin Chen1, Wei Cheng6, Jiang Bian2 1Nanjing University, 2Microsoft Research Asia, 3Tsinghua University, 4Queen Mary University of London, 5Pennsylvania State University, 6NEC Laboratories America |
| Pseudocode | Yes | Appendix C: Algorithm 1 The pseudo-code of IBProtector |
| Open Source Code | Yes | The interested reader can refer to our code for more details: https://github.com/zichuan-liu/IB4LLMs. |
| Open Datasets | Yes | We mainly evaluate our IBProtector on three datasets: Adv Bench [2], Trivia QA [36], and Easy Jailbreak [37]. Easy Jailbreak s jailbreak results (https://github.com/Easy Jailbreak/Easy Jailbreak/ tree/master?tab=readme-ov-file#-experimental-results) as adversarial prompts. |
| Dataset Splits | No | The paper specifies training and test dataset sizes but does not explicitly mention a separate validation set for model training or how it was used if present. |
| Hardware Specification | Yes | For computational resources, all our experiments are performed on a cluster with one NVIDIA 80GB Tesla A100 GPU and 4 NVIDIA Tesla 40GB V100 GPUs, where the cuda version is 12.2. |
| Software Dependencies | Yes | For computational resources, all our experiments are performed on a cluster with one NVIDIA 80GB Tesla A100 GPU and 4 NVIDIA Tesla 40GB V100 GPUs, where the cuda version is 12.2. The template for each model uses Fast Chat version 0.2.20, which is consistent with GCG (https://github.com/llm-attacks/llm-attacks/blob/main/requirements.txt). |
| Experiment Setup | Yes | Our default hyperparameters of loss weights are set as α = 0.5, λ = 1.0, and r = 0.5. We set the learning rate to be 2e-5 and the epoch to be 3 for training IBProtector and we choose Adam W as our optimizer. For LLM generation, we use greedy decoding with do_sample=False and a top-p value of 1.0 for better reproducibility. |