Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

BitMark: Watermarking Bitwise Autoregressive Image Generative Models

Authors: Louis Kerner, Michel Meintz, Bihe Zhao, Franziska Boenisch, Adam Dziedzic

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 Empirical Evaluation Experimental Setup. In the following, we employ the Infinity-2B checkpoint [11] as the default model to evaluate Bit Mark across different watermarking strengths and against a wide variety of attacks. Additionally we show the generalizability of Bit Mark to other models, namely the Infinity-8B checkpoint as well as the Instella-T2I [37] IAR model. We further show generalizability of our method to bitwise diffusion models in Section G. The Infinity models natively generate 1024 1024 pixel RGB-images, we choose the visual-tokenizer with a vocabulary size of 232 and 256 per token for the Infinity-2B and the Infinity-8B checkpoint, respectively. We follow Han et al. [11] in the choice of hyper-parameters and use a classifier-free-guidance of 4, with a generation over 13 scales, following the scale schedule for an aspect ratio of 1:1. Hence, we utilize the best performing model, tokenizer and hyperparameters reported by Han et al. [11]. We ablate the choice of the green and red list split for our experiments in Section D. For Instella IAR we follow Wang et al. [37], leveraging a vocabulary of size 232 and a cfg of 7.5 at a 1024 1024 resolution. We evaluate our experiments based on images and prompts of the MS-COCO 2014 [18] validation dataset. Unless stated otherwise, we use a subset of 10,000 images.
Researcher Affiliation	Academia	Louis Kerner, Michel Meintz, Bihe Zhao, Franziska Boenisch, Adam Dziedzic EMAIL CISPA Helmholtz Center for Information Security
Pseudocode	Yes	Algorithm 1 Watermark Embedding Inputs: autoregressive model fθ with parameters θ, green list G, red list R, image decoder D. Hyperparameters: steps K (number of resolutions), resolutions (hi, wi)K i=1, the number of tokens for resolution i is ri, constant δ added to the logits of the green list bits, n the length of the bit sequences in the lists (G and R), m the number of bits per token. for i = 1, . . . , K do (l1, . . . , lri m) = fθ(e, hi, wi) for t = 1, . . . , ri do for j = n, . . . , m do c = (t 1) m + j //counter pc = Bias(G, lc, δ) sc = Softmax(pc) bc = Sample(sc) ui = (b1, . . . , bri m) zi = Lookup(ui) zi = Interpolate(zi, h K, w K) e = e + ϕi(zi) im = D(e) Return: watermarked image im
Open Source Code	Yes	The code is available at https://github.com/sprintml/Bit Mark.
Open Datasets	Yes	We evaluate our experiments based on images and prompts of the MS-COCO 2014 [18] validation dataset.
Dataset Splits	Yes	We evaluate our experiments based on images and prompts of the MS-COCO 2014 [18] validation dataset. Unless stated otherwise, we use a subset of 10,000 images.
Hardware Specification	Yes	Our experiments are performed on Ubuntu 22.04, with Intel(R) Xeon(R) Gold 6330 CPU and NVIDIA A40 Graphics Card with 40 GB of memory.
Software Dependencies	No	Our experiments are performed on Ubuntu 22.04, with Intel(R) Xeon(R) Gold 6330 CPU and NVIDIA A40 Graphics Card with 40 GB of memory. The paper mentions the operating system version but does not provide specific version numbers for software libraries or frameworks like Python, PyTorch, or CUDA, which are essential for reproducing machine learning experiments.
Experiment Setup	Yes	We follow Han et al. [11] in the choice of hyper-parameters and use a classifier-free-guidance of 4, with a generation over 13 scales, following the scale schedule for an aspect ratio of 1:1. Hence, we utilize the best performing model, tokenizer and hyperparameters reported by Han et al. [11]. We ablate the choice of the green and red list split for our experiments in Section D. For Instella IAR we follow Wang et al. [37], leveraging a vocabulary of size 232 and a cfg of 7.5 at a 1024 1024 resolution. ... Therefore, we first generate 1,000 images with prompts of the MS-COCO2014 [18] validation dataset with our Bit Mark embedded under a watermarking strength of δ = 2. The watermarked outputs of M1 are then used to train (i.e., in our approximative setup full fine-tune) the M2 model. ... for 5 epochs with a learning rate of 10 4 and a batch-size of 4 (batch-size of 1 for Infinity-2B). ... We set the Gumbel temperature to 0.01, the cfg to 2.5, the number of iterations as well as the diffusion timesteps to 10 each.