reproducibilityindex.ai

No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices

Authors: Qi Pang, Shengyuan Hu, Wenting Zheng, Virginia Smith

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Throughout, we explore our attacks on three state-of-the-art watermarks [16, 45, 19] and two LLMs (LLAMA-2-7B [37] and OPT-1.3B [44]) demonstrating that these vulnerabilities are common to existing LLM watermarks, and providing caution for the field in deploying current solutions in practice without carefully considering the impact and trade-offs of watermarking design choices. Our code is available at https://github.com/Qi-Pang/LLM-Watermark-Attacks.
Researcher Affiliation	Academia	Qi Pang Shengyuan Hu Wenting Zheng Virginia Smith Carnegie Mellon University {qipang, shengyuanhu, wenting, smithv}@cmu.edu
Pseudocode	No	The paper describes attack procedures and watermarking schemes with mathematical formulas but does not include explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/Qi-Pang/LLM-Watermark-Attacks.
Open Datasets	Yes	Unless otherwise specified, in the evaluations of this work, we utilize 500 prompts data from Open Gen [18] dataset, and query the watermarked language models (LLAMA-2-7B [37] and OPT-1.3B [44]) to generate the watermarked outputs.
Dataset Splits	No	The paper mentions using the Open Gen dataset but does not specify explicit training, validation, and test splits (e.g., percentages or counts for each).
Hardware Specification	Yes	We conduct the experiments on a cluster with 8 NVIDIA A100 GPUs, AMD EPYC 7763 64-Core CPU, and 1TB memory.
Software Dependencies	No	The paper mentions models like LLAMA-2-7B and OPT-1.3B, and uses GPT4 for editing, but does not list specific software dependencies with version numbers for the experimental environment (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Specifically, for the toxic token insertion, we generate a list of 200 toxic tokens and insert them at random positions in the watermarked output. For the fluent inaccurate editing, we edit the watermarked sentence by querying GPT4 using the prompt Modify less than 3 words in the following sentence and make it inaccurate or have opposite meanings. Unless otherwise specified, in the evaluations of this work, we utilize 500 prompts data from Open Gen [18] dataset, and query the watermarked language models (LLAMA-2-7B [37] and OPT-1.3B [44]) to generate the watermarked outputs. We evaluate three SOTA watermarks including KGW [16], Unigram [45], and Exp [19], using the default watermarking hyperparameters. In our experiments, we default to a maximum of 200 new tokens for KGW and Unigram, and 70 for Exp, due to its complexity in the watermark detection. 70 is also the maximum number of tokens the authors of Exp evaluated in their paper [19]. For KGW [16] and Unigram [45] watermarks, we utilize the default parameters in [45], where the watermark strength is δ = 2, and the green list portion is γ = 0.5. We employ a threshold of T = 4 for these two watermarks with a single watermark key. For the scenarios where multiple keys are used, we calculate the thresholds to guarantee that the false positive rates (FPRs) are below 1e-3. For the Exp watermark (refered to as Exp-edit in [19]), we use the default parameters, where the watermark key length is n = 256 and the block size k is default to be identical to the token length. We set the p-value threshold for Exp to 0.05 in our experiments.