Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism
Authors: Beitao Chen, Xinyu Lyu, shengming yuan, Jingkuan Song, Hengtao Shen, Lianli Gao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations across three MLLMs and five benchmarks demonstrate Safe PTR s state-of-the-art performance in mitigating jailbreak risks without compromising utility. Our code is available at https://github.com/BT-C/Safe PTR. |
| Researcher Affiliation | Academia | 1 Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China 2Southwestern University of Finance and Economics, Chengdu, China 3 Engineering Research Center of Intelligent Finance, Ministry of Education 4Tongji University |
| Pseudocode | No | The paper describes the Safe PTR framework in Section 3 and illustrates its components through textual explanations and a block diagram (Fig. 5). However, it does not include a dedicated pseudocode or algorithm block with structured, step-by-step procedures. |
| Open Source Code | Yes | Our code is available at https://github.com/BT-C/Safe PTR. |
| Open Datasets | Yes | We use Jailbreak V-28K [Luo et al., 2024b] (text-driven), MM-Safety Bench [Liu et al., 2023b], and Fig Step [Gong et al., 2025] (image-driven)... Benign task accuracy is measured on MME [Fu et al., 2023] and MM-Vet [Yu et al., 2024]... |
| Dataset Splits | No | The paper mentions using datasets like Fig Step (500) and MM-Safety Bench (5040) and refers to a 'unified test set'. However, it does not explicitly provide specific details on how the datasets are split into training, validation, or test sets, such as percentages, absolute sample counts for each split, or explicit references to predefined splits. |
| Hardware Specification | Yes | All experiments are conducted on four RTX3090 GPUs. |
| Software Dependencies | No | Following Immune [Ghosal et al., 2024], we implement the proposed Safe PTR using Hugging Face Transformers library. The LLa VA1.5-7B results are based on version 1.2.2 from the official benchmark repository. The paper mentions 'Hugging Face Transformers library' but does not specify a version number for this library. 'LLaVA1.5-7B version 1.2.2' refers to a model version, not a software dependency version. |
| Experiment Setup | Yes | We set the number of tokens sampled k = 10%. For LLa VA-1.5-7B, Deep Seek-VL2, and Mini GPT-4-7B, harmful tokens are pruned in layers [7, 9), [4, 6), [7, 9). |