Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mask Image Watermarking

Authors: Runyi Hu, Jie M. Zhang, Shiqian Zhao, Nils Lukas, Jiwei Li, Qing Guo, Han Qiu, Tianwei Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that Mask WM achieves state-of-the-art performance in global and local watermark extraction, watermark localization, and multi-watermark embedding. It outperforms all existing baselines, including the recent leading model WAM for local watermarking, while preserving high visual quality of the watermarked images.
Researcher Affiliation	Academia	1Nanyang Technological University 2CFAR and IHPC, A*STAR, Singapore 3MBZUAI 4Zhejiang University 5Tsinghua University
Pseudocode	Yes	Algorithm 1: Resolution scaling watermark embedding on arbitrary resolution images
Open Source Code	Yes	https://github.com/hurunyi/Mask WM
Open Datasets	Yes	For all experiments, we train Mask WM on 83k images from the MS-COCO 2014 training set [13] and the training details are provided in Appendix C.1.1.
Dataset Splits	Yes	For global watermarking, we sample 1,000 images from the MS-COCO 2014 validation set. We divide the dataset into 12 subsets based on the ratio of masked area to the full image: 1 5%, 5 10%, 10 20%, ..., 80 90%, 90 95%, and 95 99%. From each subset, 400 images are randomly selected.
Hardware Specification	Yes	It requires only 20 hours of training on a single A6000 GPU, achieving 15 computational efficiency compared to WAM.
Software Dependencies	No	The paper mentions deep learning architectures like U-Net and U2-Net, but does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch versions).
Experiment Setup	Yes	Training is conducted for 100k steps with a batch size of 16 on a single NVIDIA A6000 GPU. We use the Adam W optimizer with a learning rate of 1 10 4, and apply a cosine learning rate scheduler with 2k warm-up steps. We adopt an easy-to-hard training strategy inspired by Trust Mark [2]. During the first 0.5k steps, the mask is set to full (i.e., all ones) and no distortion is applied. From step 0.5k to 1k, we introduce all types of masks. After 1k steps, distortions are added. The encoder loss weight βenc is fixed at 1, while the decoder loss weight βdec is initially set to 20 and linearly decayed to 0.2 over the first 5k steps. The mask loss weight α is set to 0.5. The JND module in the encoder is introduced and tuned starting from step 5k, with the scaling factor µ set to 1.