Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models

Authors: Jiaxin Song, Yixu Wang, Jie Li, Xuan Tong, rui yu, Yan Teng, Xingjun Ma, Yingchun Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on six diverse VLMs demonstrate Jail Bound s efficacy, achieves 94.32% white-box and 67.28% black-box attack success averagely, which are 6.17% and 21.13% higher than SOTA methods, respectively. Our findings expose a overlooked safety risk in VLMs and highlight the urgent need for more robust defenses.
Researcher Affiliation Collaboration 1 Shanghai Jiao Tong University 2Shanghai Artificial Intelligence Laboratory 3Fudan University 4NSFOCUS
Pseudocode Yes Algorithm 1 Safety Boundary Probing Algorithm 2 Safety Boundary Crossing
Open Source Code No While the code and data are not yet released, we commit to making both publicly available upon publication. The supplemental material outlines the structure of the planned release, including attack implementation, model configurations, and evaluation scripts to ensure reproducibility.
Open Datasets Yes We leveraged the MM-Safety Bench dataset Liu et al. [2024c], a meticulously curated multimodal safety evaluation benchmark.
Dataset Splits No The paper mentions "1,719 adversarial examples across diverse risk scenarios" as the content of the MM-Safety Bench dataset, but it does not specify how these examples were split into training, validation, or test sets for the experiments conducted in this paper.
Hardware Specification Yes All experiments are conducted on 8 NVIDIA A100 GPUs.
Software Dependencies No The paper does not provide specific version numbers for ancillary software dependencies such as libraries or frameworks used for implementation, beyond general mentions in the NeurIPS checklist about settings being in the appendix.
Experiment Setup Yes Implementation Details. We set the safety threshold P0 to 0.3 for determining the decision boundary in classifier space. The multi-objective loss is weighted with λ1 = 2.0 and λ2 = 1.0. We use different learning rates: ηv = 0.001 for visual updates and ηt = 0.0005 for textual updates, with fixed suffix length Lsuffix = 20 tokens. Visual perturbations are constrained by maximum L norm of ϵinput v = 8/255 to ensure imperceptibility. The optimization process runs for 100 150 iterations to ensure convergence.