WaterMax: breaking the LLM watermark detectability-robustness-quality trade-off
Authors: Eva Giboulot, Teddy Furon
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Its performance is both theoretically proven and experimentally validated. It outperforms all the SotA techniques under the most complete benchmark suite. ... We perform our analysis with three different LLMs ... We evaluate the performance of Water Max using the Mark My Words benchmark [29] under a variety of attacks. |
| Researcher Affiliation | Academia | Eva Giboulot Inria, CNRS, IRISA University of Rennes Rennes, France eva.giboulot@inria.fr Teddy Furon Inria, CNRS, IRISA University of Rennes Rennes, France teddy.furon@inria.fr |
| Pseudocode | Yes | The pseudo-code is summarized in Alg. 1 in App. F. ... Algorithm 1 Iterative watermarking of generated texts |
| Open Source Code | Yes | A full repository of the code used for the experiments is provided as well as the exact commands to replicate them. |
| Open Datasets | Yes | The evaluation is performed on the three long-form creative writing tasks of the Mark My Words benchmark [29]: news article generation, summarization of existing books, and writing of an invented story. This leads to the generation of 296 texts. ... For the attack suite, we use the implementation of the Mark My Words benchmark found at https://github.com/wagner-group/ Mark My Words. |
| Dataset Splits | No | The paper describes the datasets used for evaluation from the Mark My Words benchmark and the overall setup, but it does not explicitly provide the specific training, validation, and test splits (e.g., percentages or sample counts) for reproducibility, relying implicitly on the benchmark's predefined splits. |
| Hardware Specification | Yes | (left) Total generation time in seconds of Water Max as a function of the number of chunks N and the number of drafts/chunk n for texts of 256 tokens using sampling on a Nvidia A100 with Llama-2-chat-hf. |
| Software Dependencies | No | The paper mentions specific LLM models used (e.g., 'Llama3-8b-Instruct', 'Llama-2-7b-chat', 'Phi-3-mini-4k-Instruct') but does not specify general software dependencies like programming language versions (e.g., Python 3.x), or framework versions (e.g., PyTorch 1.x, TensorFlow 2.x), or CUDA versions. |
| Experiment Setup | Yes | We fix the maximum text size L to 256 tokens for all tasks (see App. K for larger text sizes). The length of a chunk ℓis fixed a priori to L/N where L is the maximum number of tokens allowed in the benchmark and N the number of chunks. ... All the evaluations in this section use the model Llama3-8b-Instruct [32, 3] ... Its temperature θ varies... watermarking scheme is not allowed to modify θ compared to the original LLM. ... For KGW and Aaronson s scheme, we use the implementation provided by [11] ... the watermark schemes all use a window size of h = 6 for hashing. ... We choose the setting (N, n) = (16, 10)... As for KGW [21], we fix δ = 2.0... We set its green-list ratio γ = 0.5... we select b = 4 where a small loss of quality is acceptable and b = 6 if a virtually lossless scheme is warranted. |