Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking

Authors: Zihan Su, Xuerui Qiu, Hongbin Xu Xu, Tangyu Jiang, Jun-hao Zhuang, Chun Yuan, Ming Li, Shengfeng He, Fei Richard Yu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we utilize the widely-used Panda-70M [21] dataset as the video source due to its extensive scale and diverse video categories. For graphical watermarks, we employ the Logo-2K+ [19] dataset, which offers a wide variety of real-world logos. The quantitative and qualitative comparisons with existing methods demonstrate that the proposed Safe-Sora achieves state-of-the-art performance in terms of video quality, watermark fidelity, and robustness.
Researcher Affiliation Collaboration Zihan Su1 Xuerui Qiu2,3 Hongbin Xu4 Tangyu Jiang1 Junhao Zhuang1 Chun Yuan1 Ming Li5 Shengfeng He6 1 Tsinghua Shenzhen International Graduate School, Tsinghua University 2 Institute of Automation, Chinese Academy of Sciences 3 Zhongguancun Academy 4 Bytedance 5 Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) 6 Singapore Management University
Pseudocode Yes Algorithm 1 Confidence-Guided Greedy Assignment for Watermark Position Recovery
Open Source Code Yes Code is publicly available at https://github.com/Sugewud/Safe-Sora
Open Datasets Yes For the video dataset, we use the Panda-70M [21] dataset for training... For the watermark dataset, we use the Logo-2K dataset [19]... For the evaluation of text-to-video generation, we employ the Vid Prom [53] dataset as the source of prompts.
Dataset Splits Yes Specifically, we randomly download 10,000 videos from Panda-70M, sample 8 frames from each video, and resize each frame to a resolution of 320 512 for training purposes. For the evaluation of text-to-video generation, we employ the Vid Prom [53] dataset as the source of prompts. The prompts in Vid Prom are generated by GPT-4 [54], and we randomly select 100 prompts from the dataset for evaluation.
Hardware Specification Yes The model is trained for 30 epochs on 4 NVIDIA RTX 4090 GPUs.
Software Dependencies No The paper mentions using Video Crafter2 [2] as a backbone model and Open-Sora [7] for evaluation but does not specify versions of programming languages, libraries, or other software dependencies.
Experiment Setup Yes The patch size is set to 16 16. Patch Embedding maps each patch to a 1024-dimensional feature space. The model is trained for 30 epochs on 4 NVIDIA RTX 4090 GPUs. We adopt the Adam W optimizer [55], with the initial learning rate set to 5e-4, which is gradually decayed to 1e-6 following a cosine decay schedule. The watermark embedding network uses M = 2 3D SFMamba Blocks, while the watermark extraction network uses N = 4 3D SFMamba Blocks. The hyperparameter λ in Eq. 7 is set to 0.75.