Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Practical and Effective Code Watermarking for Large Language Models
Authors: Zhimeng Guo, Minhao Cheng
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that ACW achieves a significant improvement in watermark detection accuracy compared to existing methods, with negligible impact on code functionality. This adaptive framework offers a promising solution for effective and practical code watermarking in the age of LLMs. Our code is available at: https://github.com/Time Lovercc/code-watermark. 5 Experiments Evaluation Overview. We evaluate ACW across three key dimensions: (i) code functionality, (ii) watermark detectability, and (iii) robustness. Our experiments primarily use two strong opensource models Open Coder-1.5B-Instruct [12] and Deep Seek-Coder-1.3B-instruct and four programming languages. Larger model Open Coder-8B-Instruct is also evaluated. Evaluation Setup. We evaluate on Human Eval [6] and MBPP [3] benchmarks for Python code generation, and Human Eval Pack [26] for other programming language testing. Experiments on Deep Seek-Coder [10] are in Appendix F.1. Performance is assessed using Pass@k metric [6] for code functionality and AUROC and TPR@5%FPR for watermark detection. We compare against posthoc detection methods including logp(x), Log Rank [8], Detect GPT [25], and GPTZero [29], as well as active watermarking approaches like WLLM [15] and EXP-edit [18] (not green-red watermarking). We also include SWEET [19] as a reference baseline, though it requires access to the original LLM. Implementation Details. For all experiments, we maintain consistent watermarking budget parameters δ = 2.0, γ = 0.5 to ensure fair comparisons. |
| Researcher Affiliation | Academia | Zhimeng Guo The Pennsylvania State University EMAIL Minhao Cheng The Pennsylvania State University EMAIL |
| Pseudocode | Yes | A Watermark Generation and Detection Algorithm of ACW A.1 Watermark Generation Algorithm 1 presents our watermark generation process. A.2 Watermark Detection Algorithm 2 details our statistical watermark detection procedure. |
| Open Source Code | Yes | Our code is available at: https://github.com/Time Lovercc/code-watermark. |
| Open Datasets | Yes | Evaluation Setup. We evaluate on Human Eval [6] and MBPP [3] benchmarks for Python code generation, and Human Eval Pack [26] for other programming language testing. Experiments on Deep Seek-Coder [10] are in Appendix F.1. |
| Dataset Splits | No | The paper mentions standard benchmarks (Human Eval, MBPP, Human Eval Pack) but does not explicitly detail the training/test/validation splits used for these datasets within the paper. It relies on the understanding that these are standard benchmarks which typically come with predefined splits, but explicit percentages or sample counts are not provided by the authors. |
| Hardware Specification | Yes | E.5 Computation Resources The experiments are run on 2 Nvidia A100 GPUs with BF16 precision. |
| Software Dependencies | No | The paper does not explicitly provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA libraries used for implementation, which would be necessary for a fully reproducible software environment. |
| Experiment Setup | Yes | Implementation Details. For all experiments, we maintain consistent watermarking budget parameters δ = 2.0, γ = 0.5 to ensure fair comparisons. Note that these parameters serve as fixed constraints for our model and are not hyperparameters subject to tuning. During code generation, we employ nucleus sampling with parameters p = 0.95 and temperature T = 0.2, following the setup in [19]. E.7 Detect GPT For the Detect GPT implementation, we used T5-3B as our model. Following the original Detect GPT paper and SWEET [25, 19] , we set the span length to 2 words and applied masking to 20% of the text. For each test, we generated 100 perturbations. E.8 SWEET Baseline For the SWEET baseline implementation, we conducted a systematic hyperparameter search for the entropy threshold, exploring values from 0.3 to 1.2 with increments of 0.3. Our experiments revealed optimal performance with an entropy threshold of 1.2 for the Human Eval dataset and 0.3 for MBPP. E.9 EXP-edit We evaluate EXP-edit [18] following the methodology of [19]. In our experiments, we set temperature=0.2 and top-p=0.95. We systematically explore block sizes (20 tokens), key sequence lengths (100), resample sizes (50 runs), and edit distance thresholds (γ = 0.0). Through extensive parameter tuning, we determine the optimal configuration: key length 100, block size 20, 50 sampling runs, and detection threshold 0.1, which balances watermark detection reliability with code functionality. |