Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks
Authors: Jiayang Liu, Siyuan Liang, Shiqian Zhao, Rong-Cheng Tu, Wenbo Zhou, Aishan Liu, Dacheng Tao, Siew Kei Lam
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct large-scale experiments on several T2V models, covering both open-source models (e.g., Open-Sora) and real commercial closed-source models (e.g., Pika, Luma, Kling). The experimental results show that the proposed method improves 11.4% and 10.0% over the existing state-of-the-art method (So TA) in terms of attack success rate assessed by GPT-4, attack success rate assessed by human accessors, respectively, verifying the significant advantages of the method in terms of attack effectiveness and content control. |
| Researcher Affiliation | Academia | Jiayang Liu Nanyang Technological University Singapore EMAIL Siyuan Liang Nanyang Technological University Singapore EMAIL Shiqian Zhao Nanyang Technological University Singapore EMAIL Rongcheng Tu Nanyang Technological University Singapore EMAIL Wenbo Zhou University of Science and Technology of China China EMAIL Aishan Liu Beihang University China EMAIL Dacheng Tao Nanyang Technological University Singapore EMAIL Siew-Kei Lam Nanyang Technological University Singapore EMAIL |
| Pseudocode | No | The paper describes the method using textual explanations and mathematical formulations, but does not include a distinct section or figure explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | Justification: The paper provides open access to the data and code, the anonymous link is {https://anonymous.4open.science/r/Neru IPS_25_t2v-CE60}. |
| Open Datasets | Yes | Dataset. Due to computational costs, we construct a subset of the T2VSafety Bench [12] dataset for our experiments. Specifically, we randomly select 50 prompts from each of 14 categories, resulting in a balanced subset with a total of 700 prompts, covering a diverse range of scenarios. |
| Dataset Splits | No | Dataset. Due to computational costs, we construct a subset of the T2VSafety Bench [12] dataset for our experiments. Specifically, we randomly select 50 prompts from each of 14 categories, resulting in a balanced subset with a total of 700 prompts, covering a diverse range of scenarios. The paper uses these 700 prompts for evaluation but does not specify further splits (e.g., training, validation) for the attack method itself. |
| Hardware Specification | Yes | Justification: As described in the Supplemental Material, all experiments were conducted on a server equipped with an Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz, 512 GB of system memory, and one NVIDIA A100 GPU with 40 GB of memory. |
| Software Dependencies | No | We utilize Video LLa MA2 [30] as the video caption model L. For input filter, we leverage the zero-shot ability of CLIP to classify the text prompts [31]. For output filter, we use the NSFW (Not Safe For Work) detection model, which is a fine-tuned Vision Transformer, as the end-to-end image classifier [32]. While these tools are mentioned, specific version numbers for them or other key software dependencies are not provided. |
| Experiment Setup | Yes | Implementation details. In our optimization function, we set λ = 3.0, β = 2.0, and γ = 1.0. We set the number of iterations to 20, and the number of variants is 5. |