Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Watermarking Autoregressive Image Generation

Authors: Nikola Jovanović, Ismail Labiad, Tomáš Souček, Martin Vechev, Pierre Fernandez

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In Sec. 4.1, we measure the effect of RCC finetuning (Sec. 3.1) and the synchronization layer (Sec. 3.2) on RCC, quality, and the power of our watermark. Sec. 4.2 studies robustness under common transformations and attacks, while Sec. 4.3 studies joint watermarking of text and images. Additional experimental details and results are given in App. E and App. F, respectively.
Researcher Affiliation	Collaboration	Nikola Jovanovi c1,2 Ismail Labiad1,3 Tomáš Souˇcek1 Martin Vechev2 Pierre Fernandez1 1FAIR, Meta 2ETH Zurich 3Université Paris-Saclay
Pseudocode	No	The paper describes algorithms and procedures in paragraph text, for example, the finetuning procedure in Section 3.1 and the full algorithm description for image synchronization in Appendix D.1, but does not present them in a structured pseudocode or algorithm block format.
Open Source Code	Yes	Code and models are available at https://github.com/facebookresearch/wmar.
Open Datasets	Yes	For TAMING and RAR-XL we set δ = 2, γ = 0.25, h = 1 and evaluate (for each transformation/attack) on 1000 generations, 100 per each of the following Image Net class indices: [1, 9, 232, 340, 568, 656, 703, 814, 937, 975]. For CHAMELEON we set δ = 2, γ = 0.25, h = 0. We again use 1000 generations, conditioning the model on a text prompt each time. Following the standard protocol in the literature [79, 80, 83, 85] we use the prompts from the validation set of MS-COCO [60].
Dataset Splits	Yes	In each experiment, we generate 1000 samples per model (100 samples per each of 10 Image Net classes for TAMING and RAR-XL, and 1000 COCO prompts for CHAMELEON). We finetune models on tokens derived from 50,000 Image Net training samples for 10 epochs.
Hardware Specification	Yes	We finetune models on tokens derived from 50,000 Image Net training samples for 10 epochs (2h on 16 V100 for TAMING, 2.5h on 8 H200 for CHAMELEON, and 0.5h on 8 H200 for RAR-XL).
Software Dependencies	No	The paper mentions using specific models and libraries like "Adam optimizer [47]", "Compress AI [7] library", and specific VAEs and diffusers libraries, but it does not provide specific version numbers for these software components or programming languages like Python or CUDA.
Experiment Setup	Yes	We use δ = 2 and γ = 0.25 in all experiments, h = 1 for TAMING, RAR-XL, and CHAMELEON on text, and h = 0 for CHAMELEON on images. We finetune models on tokens derived from 50,000 Image Net training samples for 10 epochs... We use the Adam optimizer [47] with a learning rate of 10 4, multiplied by a factor of 0.9 each epoch (Step LR). We use a total batch size across all GPUs of 64... and always set λ = 1. As noted above, we use a set of augmentations A to improve robustness of our watermark to transformations and attacks.