reproducibilityindex.ai

WaterMax: breaking the LLM watermark detectability-robustness-quality trade-off

Authors: Eva Giboulot, Teddy Furon

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Its performance is both theoretically proven and experimentally validated. It outperforms all the SotA techniques under the most complete benchmark suite. ... We perform our analysis with three different LLMs ... We evaluate the performance of Water Max using the Mark My Words benchmark [29] under a variety of attacks.
Researcher Affiliation	Academia	Eva Giboulot Inria, CNRS, IRISA University of Rennes Rennes, France eva.giboulot@inria.fr Teddy Furon Inria, CNRS, IRISA University of Rennes Rennes, France teddy.furon@inria.fr
Pseudocode	Yes	The pseudo-code is summarized in Alg. 1 in App. F. ... Algorithm 1 Iterative watermarking of generated texts
Open Source Code	Yes	A full repository of the code used for the experiments is provided as well as the exact commands to replicate them.
Open Datasets	Yes	The evaluation is performed on the three long-form creative writing tasks of the Mark My Words benchmark [29]: news article generation, summarization of existing books, and writing of an invented story. This leads to the generation of 296 texts. ... For the attack suite, we use the implementation of the Mark My Words benchmark found at https://github.com/wagner-group/ Mark My Words.
Dataset Splits	No	The paper describes the datasets used for evaluation from the Mark My Words benchmark and the overall setup, but it does not explicitly provide the specific training, validation, and test splits (e.g., percentages or sample counts) for reproducibility, relying implicitly on the benchmark's predefined splits.
Hardware Specification	Yes	(left) Total generation time in seconds of Water Max as a function of the number of chunks N and the number of drafts/chunk n for texts of 256 tokens using sampling on a Nvidia A100 with Llama-2-chat-hf.
Software Dependencies	No	The paper mentions specific LLM models used (e.g., 'Llama3-8b-Instruct', 'Llama-2-7b-chat', 'Phi-3-mini-4k-Instruct') but does not specify general software dependencies like programming language versions (e.g., Python 3.x), or framework versions (e.g., PyTorch 1.x, TensorFlow 2.x), or CUDA versions.
Experiment Setup	Yes	We fix the maximum text size L to 256 tokens for all tasks (see App. K for larger text sizes). The length of a chunk ℓis fixed a priori to L/N where L is the maximum number of tokens allowed in the benchmark and N the number of chunks. ... All the evaluations in this section use the model Llama3-8b-Instruct [32, 3] ... Its temperature θ varies... watermarking scheme is not allowed to modify θ compared to the original LLM. ... For KGW and Aaronson s scheme, we use the implementation provided by [11] ... the watermark schemes all use a window size of h = 6 for hashing. ... We choose the setting (N, n) = (16, 10)... As for KGW [21], we fix δ = 2.0... We set its green-list ratio γ = 0.5... we select b = 4 where a small loss of quality is acceptable and b = 6 if a virtually lossless scheme is warranted.