Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MetaDefense: Defending Fine-tuning based Jailbreak Attack Before and During Generation
Authors: Weisen Jiang, Sinno Jialin Pan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across multiple LLM architectures (LLa MA-2-7B, Qwen-2.5-3B-Instruct, and LLa MA-3.2-3B-Instruct) demonstrate that Meta Defense significantly outperforms existing defense mechanisms, achieving robust defense against harmful queries with seen and unseen attack templates while maintaining competitive performance on benign tasks. |
| Researcher Affiliation | Academia | Weisen Jiang Sinno Jialin Pan Department of Computer Science and Engineering Chinese University of Hong Kong Hong Kong EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 shows the inference procedure of our Meta Defense. [...] Algorithm 2 shows the training procedure of Meta Defense. |
| Open Source Code | Yes | Code is available at https://github.com/ws-jiang/Meta Defense. |
| Open Datasets | Yes | Following [12, 13], at the alignment stage, we sample 2500 harmful queries with harmful responses and 2500 harmful queries with refusal responses from [43] to construct DHF-HF and DHF-HL, respectively. We sample 5000 harmless queries with responses from Alpaca [46] to construct DHL. The harmful queries used in finetuning or attacking are disjoint from those at the alignment stage. At the finetuning stage, following [13], we consider three benign tasks: SST2 (binary classification task) [45], AGNews (multiple choice task) [55], and GSM8K (open-ended generation tasks) [4]. |
| Dataset Splits | No | The paper describes the composition of the datasets used for alignment and finetuning (e.g., 2500 harmful queries, 5000 harmless queries, mixing p percentage of harmful samples with benign samples for finetuning). However, it does not explicitly provide train/test/validation splits (e.g., specific percentages or sample counts for splitting a dataset into training, validation, and test sets) for the datasets used in the experiments (SST2, AGNews, GSM8K), instead implying the use of 'testing data' without detailing the split methodology. |
| Hardware Specification | Yes | All experiments are run on NVIDIA L40S 40G. |
| Software Dependencies | No | The paper mentions software components like 'Lo RA [11]' and 'Adam W optimizer [30]' but does not provide specific version numbers for these or any other software libraries, frameworks, or programming languages used. |
| Experiment Setup | Yes | Following [13, 12], we adopt Lo RA [11] for LLM training with rank and alpha set to 32 and 4, respectively. For alignment training, we use Adam W optimizer [30] with a learning rate of 5e-4 and a weight decay factor of 0.1. For alignment, We train 20, 5, and 3 epochs on the alignment dataset for LLa MA-2-7B, Qwen-2.5-3B-Instruct, and LLa MA-3.2-3B-Instruct, respectively. For finetuning, we train the aligned LLM for 20 epochs on the benign task data with harmful samples. We use a mini-batch size of 10 for both the alignment and finetuning stage. |