No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices

Authors: Qi Pang, Shengyuan Hu, Wenting Zheng, Virginia Smith

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Throughout, we explore our attacks on three state-of-the-art watermarks [16, 45, 19] and two LLMs (LLAMA-2-7B [37] and OPT-1.3B [44]) demonstrating that these vulnerabilities are common to existing LLM watermarks, and providing caution for the field in deploying current solutions in practice without carefully considering the impact and trade-offs of watermarking design choices. Our code is available at https://github.com/Qi-Pang/LLM-Watermark-Attacks.
Researcher Affiliation Academia Qi Pang Shengyuan Hu Wenting Zheng Virginia Smith Carnegie Mellon University {qipang, shengyuanhu, wenting, smithv}@cmu.edu
Pseudocode No The paper describes attack procedures and watermarking schemes with mathematical formulas but does not include explicit pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/Qi-Pang/LLM-Watermark-Attacks.
Open Datasets Yes Unless otherwise specified, in the evaluations of this work, we utilize 500 prompts data from Open Gen [18] dataset, and query the watermarked language models (LLAMA-2-7B [37] and OPT-1.3B [44]) to generate the watermarked outputs.
Dataset Splits No The paper mentions using the Open Gen dataset but does not specify explicit training, validation, and test splits (e.g., percentages or counts for each).
Hardware Specification Yes We conduct the experiments on a cluster with 8 NVIDIA A100 GPUs, AMD EPYC 7763 64-Core CPU, and 1TB memory.
Software Dependencies No The paper mentions models like LLAMA-2-7B and OPT-1.3B, and uses GPT4 for editing, but does not list specific software dependencies with version numbers for the experimental environment (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Specifically, for the toxic token insertion, we generate a list of 200 toxic tokens and insert them at random positions in the watermarked output. For the fluent inaccurate editing, we edit the watermarked sentence by querying GPT4 using the prompt Modify less than 3 words in the following sentence and make it inaccurate or have opposite meanings. Unless otherwise specified, in the evaluations of this work, we utilize 500 prompts data from Open Gen [18] dataset, and query the watermarked language models (LLAMA-2-7B [37] and OPT-1.3B [44]) to generate the watermarked outputs. We evaluate three SOTA watermarks including KGW [16], Unigram [45], and Exp [19], using the default watermarking hyperparameters. In our experiments, we default to a maximum of 200 new tokens for KGW and Unigram, and 70 for Exp, due to its complexity in the watermark detection. 70 is also the maximum number of tokens the authors of Exp evaluated in their paper [19]. For KGW [16] and Unigram [45] watermarks, we utilize the default parameters in [45], where the watermark strength is δ = 2, and the green list portion is γ = 0.5. We employ a threshold of T = 4 for these two watermarks with a single watermark key. For the scenarios where multiple keys are used, we calculate the thresholds to guarantee that the false positive rates (FPRs) are below 1e-3. For the Exp watermark (refered to as Exp-edit in [19]), we use the default parameters, where the watermark key length is n = 256 and the block size k is default to be identical to the token length. We set the p-value threshold for Exp to 0.05 in our experiments.