Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Theoretically Grounded Framework for LLM Watermarking: A Distribution-Adaptive Approach
Authors: Haiyun He, Yepeng Liu, Ziqiao Wang, Yongyi Mao, Yuheng Bu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on Llama2-13B and Mistral-8 7B models confirm the effectiveness of our approach, particularly at ultra-low FPRs. |
| Researcher Affiliation | Academia | Haiyun He HKUST (GZ) Guangzhou, China EMAIL Yepeng Liu UC Santa Barbara Santa Barbara, CA, USA EMAIL Ziqiao Wang Tongji University Shanghai, China EMAIL Yongyi Mao University of Ottawa Ottawa, ON, Canada EMAIL Yuheng Bu UC Santa Barbara Santa Barbara, CA, USA EMAIL |
| Pseudocode | Yes | Algorithm 1 Watermarked Text Generation Algorithm 2 Watermarked Text Detection |
| Open Source Code | Yes | Our code is available at https: //github.com/yepengliu/DAWA. |
| Open Datasets | Yes | Our experiments are conducted using two distinct datasets. The first is an open-ended high-entropy generation dataset, a realnewslike subset from C4 [49]. The second is a relatively lowentropy generation dataset, ELI5 [50]. |
| Dataset Splits | Yes | We conduct experiments using Llama2-13B on 105 texts (200-length) from the Wikipedia dataset and compute the TPR at 1e 01, 1e 02, 1e 03, 1e 04, and 1e 05 FPR respectively. ... We utilize the first two sentences of each text as prompts and the following 200 tokens as human-generated text. |
| Hardware Specification | Yes | We conduct our experiments on Nvidia A100 GPUs. |
| Software Dependencies | No | The paper does not explicitly state specific version numbers for software dependencies such as Python, PyTorch, or CUDA versions. |
| Experiment Setup | Yes | In DAWA, we set η = 0.2 and T = 200. ... KGW+23 employs the prior 1 token as a hash to create a green/red list, with the watermark strength set at 2. |