Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

HeavyWater and SimplexWater: Distortion-free LLM Watermarks for Low-Entropy Distributions

Authors: Dor Tsur, Carol Long, Claudio Mayrink Verdun, Sajani Vithana, Hsiang Hsu, Richard Chen, haim permuter, Flavio Calmon

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance of Heavy Water and Simplex Water across several benchmarks, demonstrating that they achieve superior detection with minimal degradation in downstream quality across generation tasks. We evaluate the detection-distortion-robustness tradeoff of Heavy Water and Simplex Water following the experimental benchmarking guidelines in Waterbench [43] and Mark My Words [44]. We use two popular prompt-generation datasets: Finance-QA [45] (covering Q&A tasks) and LCC [46] (covering coding tasks). We evaluate on three open-weight models: Llama2-7B [47], Llama3-8B [48], and Mistral-7B [49]. Implementation details are provided in Appendix C.4, and full results, including ablation studies, are presented in Appendix D.
Researcher Affiliation Collaboration Dor Tsur Ben Gurion University Carol Xuan Long Harvard University Claudio Mayrink Verdun Harvard University Hsiang Hsu JPMorgan Chase Richard Chen JPMorgan Chase Haim Permuter Ben Gurion University Sajani Vithana Harvard University Flavio P. Calmon Harvard University
Pseudocode Yes Algorithm 1 Simplex Water and Heavy Water Watermark Generation
Open Source Code Yes Furthermore, we provide the code which can be used to recreate the results we report. We provide code implementation of the proposed algorithms and methods. Those are provided in the Supplement material in an anonymized version and will be made public upon acceptance.
Open Datasets Yes We use two popular prompt-generation datasets: Finance-QA [45] (covering Q&A tasks) and LCC [46] (covering coding tasks).
Dataset Splits No The paper mentions using Finance-QA [45] and LCC [46] datasets and specifies evaluation parameters like "At a generation length of T = 300" and how subsets were sampled for bootstrapping error bars ("Out of the 200 responses, we randomly sample 200 subsets with 150 responses"). However, it does not explicitly provide information about train/test/validation splits for the main experimental evaluation of the watermarking methods.
Hardware Specification Yes The experiments section in the main paper provides information on the type of GPUU used to run our experiments (Namely, a single A-100 GPU).
Software Dependencies No In our implementation, we solve Sinkhorn s algorithm using the Python optimal transport package [65]. However, specific version numbers for this package or other core software dependencies (e.g., Python, PyTorch, CUDA) are not provided in the paper.
Experiment Setup Yes Our watermarks receive as input the next-token distribution PX, side information sample s S, and a score function f. The amount of tilting (and thereby the level is distortion) is determined by a parameter δ (0, 1). The structure of tilting depends on the structure of f and is later defined for each watermark separately. We solve (2) using Sinkhorn s algorithm [19]. At a generation length of T = 300, we observe that the Gaussian approximation provided by the z-test is very accurate, ensuring that p-values remain reliable even when the exact distribution is unknown. We note that, in many text generation schemes, lower top-p values, which accelerate Sinkhorn s algorithm s runtime, thus further closing the computational gap.