Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Authors: Zhixin Xie, Xurui Song, Jun Luo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our attack is comprehensively evaluated by comparing it with five baselines on ten models. Experiments demonstrate that our method achieves significant advantages in both attack effectiveness and attack stealth.
Researcher Affiliation Academia Zhixin Xie, Xurui Song, and Jun Luo College of Computing and Data Science, Nanyang Technological University, Singapore Correspondence to: EMAIL
Pseudocode No The paper describes the attack strategy using text and a visual overview in Figure 1. It does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/ZHIXINXIE/ten_benign.git.
Open Datasets Yes We evaluate the effectiveness of all the attacks on Adv Bench [65]. We expand our testing to six benchmarks: Adv Bench, Air Bench [62], Harm Bench [28], Jailbreak Bench [4], Sorry Bench [59], and Strong Reject Bench [52]. To supplement our main findings, we have evaluated the utility of all original and compromised models on GSM8K [7], MMLU [15], and Writing Bench [58] with zero-shot and single trial (pass@1).
Dataset Splits Yes We sample 200 questions from each benchmark, except for Jailbreak Bench, from which we use all 100 available questions. In addition, we use two well-established judges for automated evaluation: the Harm Bench-Llama-2-13B-cls classifier [18] and the prompt-driven Strong Reject method [52]. Following the Strong Reject method [52], an answer is considered a successful attack if its refusal_answer score is 0 while both its convincing_answer and specificity_answer scores are greater than 4.
Hardware Specification Yes We use 2 A100 80G GPU to complete our fine-tuning, and use about 140G memory to fine-tune the opensource models. On average, our attack takes less than 2 minutes to complete the fine-tuning process, as the dataset contains 10 benign QA pairs. For the online fine-tuning, we use the dashboard provided by Open AI 2
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as programming languages or libraries.
Experiment Setup Yes For simplicity, we set the batch size as 1 for all the models. For all five open source models (Llama2-7b, Llama3-8b, Deepseek-R1-Distill-Llama3-8b, Qwen2.5-7b, and Qwen3-8b), we set epochs as 10 and learning rate as 1e-5 for both two stages. For the GPT family models, we use different settings. For smaller models such as GPT-4o-mini and GPT-4.1-mini, we set the learning multiplier as 1.8 and the epoch as 2 for Stage-1, and the learning multiplier as 5 and the epoch as 10 for Stage-2. For the bigger models such as GPT-3.5-turbowe set the learning multiplier as 1.8 and the epoch as 2 for Stage-1, and the learning multiplier as 10 and the epoch as 10 for the Stage-2.