Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On the Empirical Power of Goodness-of-Fit Tests in Watermark Detection

Authors: Weiqing He, Xiang Li, Tianqi Shang, Li Shen, Weijie Su, Qi Long

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we systematically evaluate eight Go F tests across three popular watermarking schemes, using three open-source LLMs, two datasets, various generation temperatures, and multiple post-editing methods. We find that general Go F tests can improve both the detection power and robustness of watermark detectors. Notably, we observe that text repetition, common in low-temperature settings, gives Go F tests a unique advantage not exploited by existing methods. Our results highlight that classic Go F tests are a simple yet powerful and underused tool for watermark detection in LLMs.
Researcher Affiliation	Academia	Weiqing He University of Pennsylvania EMAIL Li University of Pennsylvania EMAIL Shang University of Pennsylvania EMAIL Shen University of Pennsylvania EMAIL Su University of Pennsylvania EMAIL Long University of Pennsylvania EMAIL
Pseudocode	Yes	Algorithm 1 Go F test for watermark detection (example: Kol for the Gumbel max watermark)Require: Token sequence w1:n; watermark decoder S; significance level α; CDF under the null F0.1: Compute pivotal statistics Y1, . . . , Yn from the sequence w1:n. 2: Compute p-values: pt = 1 F0(Yt), t = 1, . . . , n. 3: Sort the p-values in ascending order: p(1) p(n). 4: Compute the test statistic (Kol)Dn max1 i n max p(i) i 15: Estimate critical value γα based on the information of the watermarking scheme. 6: if Dn > γα then 7: Reject H0 8: else 9: Do not reject H0 10: end if
Open Source Code	Yes	Code is available at https://github.com/hwq0726/Go F-for-Watermark-Detection.
Open Datasets	Yes	For text completion, we use the C4 dataset [46]. Each document in the dataset is truncated to the first 50 tokens, which serve as prompts for the LLM to complete. For long-form question answering, we use the ELI5 dataset [12], where the LLM generates detailed answers to given questions.
Dataset Splits	No	In both tasks, we randomly sample 1,000 documents and conduct experiments using three LLMs across four temperature settings, following a consistent pipeline. As the relative performance of different Go F tests is similar between tasks, we present the results on text completion in this section and defer the detailed results for the ELI5 dataset to Appendix D.4. To this end, we randomly sample 1,000 human-written texts from the C4 dataset and assess Type I error at a significance level of α = 0.01, a standard choice in prior work [9, 33, 29, 31].
Hardware Specification	Yes	All text generation tasks were conducted on NVIDIA A100 GPUs, with a total computational cost of approximately 360 GPU hours to reproduce all experiments.
Software Dependencies	No	It s worth noting that the limiting distributions offer valuable theoretical tools, but in practice, simulations offer users a more flexible and convenient way to compute critical values like in Python package Sci Py4, due to the complex forms of some distributions (e.g., the Anderson-Darling distribution [36, 2]).
Experiment Setup	Yes	In our evaluation, we consider three open-source LLMs OPT-1.3B, OPT13B [59], and Llama 3.1-8B [11] across four temperature settings: T {0.1, 0.3, 0.7, 1.0}. We evaluate watermark performance on two text generation tasks: (i) text completion and (ii) long-form question answering. For text completion, we use the C4 dataset [46]... For long-form question answering, we use the ELI5 dataset [12]... In both tasks, we randomly sample 1,000 documents and conduct experiments using three LLMs across four temperature settings... To control the Type I error at α = 0.01, we adjust the critical values using either theoretical distributions or Monte Carlo simulations.