Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models

Authors: Jiaxin Song, Yixu Wang, Jie Li, Xuan Tong, rui yu, Yan Teng, Xingjun Ma, Yingchun Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on six diverse VLMs demonstrate Jail Bound s efficacy, achieves 94.32% white-box and 67.28% black-box attack success averagely, which are 6.17% and 21.13% higher than SOTA methods, respectively. Our findings expose a overlooked safety risk in VLMs and highlight the urgent need for more robust defenses.
Researcher Affiliation	Collaboration	1 Shanghai Jiao Tong University 2Shanghai Artificial Intelligence Laboratory 3Fudan University 4NSFOCUS
Pseudocode	Yes	Algorithm 1 Safety Boundary Probing Algorithm 2 Safety Boundary Crossing
Open Source Code	No	While the code and data are not yet released, we commit to making both publicly available upon publication. The supplemental material outlines the structure of the planned release, including attack implementation, model configurations, and evaluation scripts to ensure reproducibility.
Open Datasets	Yes	We leveraged the MM-Safety Bench dataset Liu et al. [2024c], a meticulously curated multimodal safety evaluation benchmark.
Dataset Splits	No	The paper mentions "1,719 adversarial examples across diverse risk scenarios" as the content of the MM-Safety Bench dataset, but it does not specify how these examples were split into training, validation, or test sets for the experiments conducted in this paper.
Hardware Specification	Yes	All experiments are conducted on 8 NVIDIA A100 GPUs.
Software Dependencies	No	The paper does not provide specific version numbers for ancillary software dependencies such as libraries or frameworks used for implementation, beyond general mentions in the NeurIPS checklist about settings being in the appendix.
Experiment Setup	Yes	Implementation Details. We set the safety threshold P0 to 0.3 for determining the decision boundary in classifier space. The multi-objective loss is weighted with λ1 = 2.0 and λ2 = 1.0. We use different learning rates: ηv = 0.001 for visual updates and ηt = 0.0005 for textual updates, with fixed suffix length Lsuffix = 20 tokens. Visual perturbations are constrained by maximum L norm of ϵinput v = 8/255 to ensure imperceptibility. The optimization process runs for 100 150 iterations to ensure convergence.