reproducibilityindex.ai

A Watermark for Large Language Models

Authors: John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, Tom Goldstein

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	6. Experiments In this section we explore the behavior of the watermark using the OPT-1.3B model (Zhang et al., 2022). We measure watermark strength using the rate of type-I errors (human text falsely flagged as watermarked) and type-II errors (watermarked text not detected).
Researcher Affiliation	Academia	John Kirchenbauer * 1 Jonas Geiping * 1 Yuxin Wen 1 Jonathan Katz 1 Ian Miers 1 Tom Goldstein 1 ... 1University of Maryland.
Pseudocode	Yes	Algorithm 1 Text Generation with Hard Red List ... Algorithm 2 Text Generation with Soft Red List
Open Source Code	Yes	Code is available at https://www.github.com/ jwkirchenbauer/lm-watermarking.
Open Datasets	Yes	We test the watermark using a multi-billion parameter model from the Open Pretrained Transformer (OPT) family ... slice and dice a random selection of texts from the news-like subset of the C4 dataset (Raffel et al., 2019).
Dataset Splits	No	The paper uses generated text from LLMs and evaluates watermarks on it. While it mentions prompts from the C4 dataset and uses a validation set for a specific experiment (TriviaQA), it does not explicitly provide traditional train/validation/test dataset splits for the watermarking evaluation needed to reproduce the main experiments in terms of data partitioning.
Hardware Specification	No	The paper mentions implementing the watermark using PyTorch and Huggingface, and lists models used, but does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies	No	We implement the proposed watermark using the Pytorch backend of the Huggingface library (Wolf et al., 2020).
Experiment Setup	Yes	Watermark parameters are γ, δ = (0.25, 2). ... We compute results using 500 +/- 10 sequences of length T = 200 +/- 5 tokens for each parameter choice. ... When a multinomial sampler is used (which is assumed by Theorem 4.2), we use the softmax output with standard temperature hyperparameter temp=0.7.