Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On the Coexistence and Ensembling of Watermarks

Authors: Aleksandar Petrov, Shruti Agarwal, Philip H.S. Torr, Adel Bibi, John Collomosse

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform the first study of coexistence of deep image watermarking methods and, contrary to intuition, we find that various open-source watermarks can coexist with only minor impacts on image quality and decoding robustness. The coexistence of watermarks also opens the avenue for ensembling watermarking methods. We show how ensembling can increase the overall message capacity and enable new trade-offs between capacity, accuracy, robustness and image quality, without needing to retrain the base models. Our experiments focus on image watermarks as the most mature watermarking domain but the same principles likely apply to video, audio and possibly text watermarking.
Researcher Affiliation	Collaboration	Aleksandar Petrov A,O Shruti Agarwal A Philip Torr O Adel Bibi O John Collomosse A,S AAdobe Research, OUniversity of Oxford, SUniversity of Surrey
Pseudocode	Yes	Pseudo-code for ensembling and strength clipping 1 def series_ensemble ( 2 original: Image , 3 wm1: Watermarking Method , wm2: Watermarking Method , # callable watermarking methods 4 m1: List[bool], m2: List[bool] # the binary secrets for the coressponding watermarking methods 5 ) -> Image: 7 watermarked1 = wm1(original , m1) 8 watermarked2 = wm2(watermarked1 , m2) 9 return watermarked2 11 def parallel_ensemble ( 12 original: Image , 13 wm1: Watermarking Method , wm2: Watermarking Method , # callable watermarking methods 14 m1: List[bool], m2: List[bool] # the binary secrets for the coressponding watermarking methods 15 ) -> Image:
Open Source Code	No	To ensure reproducibility of our experiments, we restricted ourselves to watermarking methods with open-source implementations. We used the invisible-watermark implementations of Dwt Dct, Dwt Dct Svd and Riva GAN and the official implementations of SSL, Ro Ste ALS and Trust Mark. For Hi DDe N, we used the Stable Signature reimplementation. Furthermore, we provide pseudo-code for our ensembling strategies (in App. B), detailed description of the error-correcting codes we used (in App. C), details about the augmentations for the robustness measures we benchmarked against (in App. D) and comprehensive tables with all the experiments discussed in the paper (in App. F). Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Unfortunately, at this time, the data and code are proprietary, but we provide sufficient instructions to reproduce all experimental results. We are also currently exploring whether we can release an open-source version of the evaluation setup.
Open Datasets	No	The results are averaged over 1020 samples from Adobe Stock. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Unfortunately, at this time, the data and code are proprietary, but we provide sufficient instructions to reproduce all experimental results. We are also currently exploring whether we can release an open-source version of the evaluation setup.
Dataset Splits	No	The results are averaged over 1020 samples from Adobe Stock. Watermarking requires a trade-off between four competing objectives: i. Capacity: the length of the secret message (in bits); ii. Image quality: the amount of distortion added to the cover image to embed the secret, often measured in peak signal-to-noise ratio (PSNR), larger values indicating better quality; iii. Accuracy: the fraction of correctly decoded secrets, usually over a test dataset of images; iv. Robustness: the fraction of secrets we can decode correctly when certain transformations or edits have been added to the image after watermarking. While the paper mentions using a 'test dataset of images' and 1020 samples from 'Adobe Stock', it does not specify any training/validation/test splits or how the 1020 samples are partitioned, if at all beyond being a test set.
Hardware Specification	No	Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: We only perform inference on relatively small models. Thre has been no training of models in the scope of this work. Therefore, the compute costs are negligible. The paper states compute costs are negligible for inference but provides no specific details about the hardware used for this inference.
Software Dependencies	No	To ensure reproducibility of our experiments, we restricted ourselves to watermarking methods with open-source implementations. We used the invisible-watermark implementations of Dwt Dct, Dwt Dct Svd and Riva GAN and the official implementations of SSL, Ro Ste ALS and Trust Mark. For Hi DDe N, we used the Stable Signature reimplementation. The paper lists software packages and implementations used but does not provide specific version numbers for any of them (e.g., Python version, library versions like PyTorch, or specific versions of invisible-watermark, SSL, Ro Ste ALS, Trust Mark, or Stable Signature).
Experiment Setup	Yes	Experimental setup for testing post-training watermark modifiers. We consider two ensembling approaches: series, where the second watermark is applied to an image already watermarked by the first method (order matters), and parallel, where both watermarks are independently applied to the original image, their residuals are averaged, with the result applied to the original image (order doesn’t matter). Pseudo code is provided in Lst. 1. To adjust the watermark strength, we apply PSNR clipping where a strength of 0 means the target PSNR is the lower of the individual methods’ PSNRs, and a strength of 1 means it is the higher one (pseudo code in Lst. 2). Clipping can also use strengths outside this range. While one could apply the same target PSNR for all images, we opted for an image-wise approach since PSNR values vary greatly between images. We use linear codes (Purser, 1995; Guruswami et al., 2023) with block lengths matching the sum of the capacities of the two base methods. We denote a binary linear code as LC[n, k, d], with n being the block size, k the message size, and d the minimum Hamming distance between codewords. Such a code reduces the effective capacity from n bits to k bits and can correct up to (d − 1)/2 bit flips using the resulting n − k redundancy bits. The codes we use are from the code tables of Grassl (2006). Ensembling with another method can boost the watermarking performance. We observe cases where a method is improved along all dimensions. For example, in Fig. 5C, we can see that when Ro Ste ALS is ensembled with SSL (42 d B, 30 bits) (series application, 0.2 clipping, LC[130, 100, 10], Lst. 7), we can maintain 100 bit capacity while boosting accuracy by almost 15%, robustness with respect to Riva GAN augmentations with more than 8% and quality with 2d B while maintaining the robustness to SSL augmentations. Similarly, in Fig. 5D we see the capacity of Hi DDe N boosted from 48 bits to 102 bits when ensembled with Ro Ste ALS (series, 1.0 clip, LC[148, 102, 14], Lst. 10), with also small improvements in accuracy, robustness, and quality.