Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking
Authors: Zihan Su, Xuerui Qiu, Hongbin Xu Xu, Tangyu Jiang, Jun-hao Zhuang, Chun Yuan, Ming Li, Shengfeng He, Fei Richard Yu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we utilize the widely-used Panda-70M [21] dataset as the video source due to its extensive scale and diverse video categories. For graphical watermarks, we employ the Logo-2K+ [19] dataset, which offers a wide variety of real-world logos. The quantitative and qualitative comparisons with existing methods demonstrate that the proposed Safe-Sora achieves state-of-the-art performance in terms of video quality, watermark fidelity, and robustness. |
| Researcher Affiliation | Collaboration | Zihan Su1 Xuerui Qiu2,3 Hongbin Xu4 Tangyu Jiang1 Junhao Zhuang1 Chun Yuan1 Ming Li5 Shengfeng He6 1 Tsinghua Shenzhen International Graduate School, Tsinghua University 2 Institute of Automation, Chinese Academy of Sciences 3 Zhongguancun Academy 4 Bytedance 5 Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) 6 Singapore Management University |
| Pseudocode | Yes | Algorithm 1 Confidence-Guided Greedy Assignment for Watermark Position Recovery |
| Open Source Code | Yes | Code is publicly available at https://github.com/Sugewud/Safe-Sora |
| Open Datasets | Yes | For the video dataset, we use the Panda-70M [21] dataset for training... For the watermark dataset, we use the Logo-2K dataset [19]... For the evaluation of text-to-video generation, we employ the Vid Prom [53] dataset as the source of prompts. |
| Dataset Splits | Yes | Specifically, we randomly download 10,000 videos from Panda-70M, sample 8 frames from each video, and resize each frame to a resolution of 320 512 for training purposes. For the evaluation of text-to-video generation, we employ the Vid Prom [53] dataset as the source of prompts. The prompts in Vid Prom are generated by GPT-4 [54], and we randomly select 100 prompts from the dataset for evaluation. |
| Hardware Specification | Yes | The model is trained for 30 epochs on 4 NVIDIA RTX 4090 GPUs. |
| Software Dependencies | No | The paper mentions using Video Crafter2 [2] as a backbone model and Open-Sora [7] for evaluation but does not specify versions of programming languages, libraries, or other software dependencies. |
| Experiment Setup | Yes | The patch size is set to 16 16. Patch Embedding maps each patch to a 1024-dimensional feature space. The model is trained for 30 epochs on 4 NVIDIA RTX 4090 GPUs. We adopt the Adam W optimizer [55], with the initial learning rate set to 5e-4, which is gradually decayed to 1e-6 following a cosine decay schedule. The watermark embedding network uses M = 2 3D SFMamba Blocks, while the watermark extraction network uses N = 4 3D SFMamba Blocks. The hyperparameter λ in Eq. 7 is set to 0.75. |