Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

New Evaluation Metrics Capture Quality Degradation due to LLM Watermarking

Authors: Karanpartap Singh, James Zou

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments, conducted across various datasets, reveal that current watermarking methods are moderately detectable by even simple classifiers, challenging the notion of watermarking subtlety. We tested three datasets in this study. For all datasets, a section of text up to 50 words long was spliced from each sample, after which the 7 billion parameter variant of the Llama-2 model was tasked with completing the output (Touvron et al., 2023), both with and without a watermarking layer applied.
Researcher Affiliation Academia Karanpartap Singh EMAIL Department of Electrical Engineering Stanford University James Zou EMAIL Department of Biomedical Data Science Stanford University
Pseudocode No The paper describes methods in prose and does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code Availability: The source code for all experiments is available at https://github.com/su-karanps/watermark_eval.
Open Datasets Yes We tested three datasets in this study. 1. Long Form, Validation Set: we used the Wikipedia subset of the Long Form dataset s validation set, consisting of 251 human-written documents on various topics (Köksal et al., 2023). 2. C4-Real News Like, Validation Set: A subset of the Common Crawl web crawl corpus, the Real News Like dataset contains text extracted from online news articles (Raffel et al., 2019). We used 500 samples from this dataset. 3. Scientific Papers, Test Set: A collection of long, structured documents from the ar Xiv and Pub Med open access article repositories (Cohan et al., 2018). We used the abstracts from 252 samples.
Dataset Splits Yes This classifier achieved an accuracy just above random guessing, at approximately 56%, across various datasets using k-fold cross-validation with 5 folds. When evaluating all of the datasets together, k-fold cross-validation was used with 5-folds. For the three individual datasets, each algorithm was trained on the indicated dataset, and tested for generalizability on the other two datasets.
Hardware Specification Yes We note that, on the same hardware (1x NVIDIA Tesla GPU), this watermark took between 15-20x the computation time for each generation as compared to the original model or soft-watermark.
Software Dependencies No The paper mentions the use of 'Adam optimizer' and 'torch.optim.lr_scheduler.ReduceLROnPlateau' but does not provide specific version numbers for these software components or the underlying programming language/frameworks.
Experiment Setup Yes Optimization was performed using the Adam optimizer with: β1 = 0.5, β2 = 0.999, learning rate and weight decay determined through a hyperparameter grid search (parameters shown in Table C). A dynamic learning rate scheduler was used (torch.optim.lr_scheduler.Reduce LROn Plateau) with a factor of 0.5 and a patience of 50 epochs. Training was conducted for 150 epochs. Table 7: Binary classifier hyperparameters Adam Weight Decay Grid search over {2e 4, 2e 3, 2e 2} Learning Rate Grid search over {2e 5, 2e 4, 2e 3} Batch Size Grid search over {50, 75, 100} Dataset Randomization Grid search over {On, Off}