Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Protecting Language Generation Models via Invisible Watermarking
Authors: Xuandong Zhao, Yu-Xiang Wang, Lei Li
ICML 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that GINSEW can effectively identify instances of IP infringement with minimal impact on the generation quality of protected APIs. |
| Researcher Affiliation | Academia | 1Department of Computer Science, UC Santa Barbara. Correspondence to: Xuandong Zhao <EMAIL>, Yu-Xiang Wang <EMAIL>, Lei Li <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Watermarking process; Algorithm 2 Watermark detection; Algorithm 3 Watermark detection with text alone. |
| Open Source Code | Yes | 1Our source code is available at https://github.com/ Xuandong Zhao/ginseq. |
| Open Datasets | Yes | In the machine translation task, we utilize the IWSLT14 and WMT14 datasets (Cettolo et al., 2014; Bojar et al., 2014), specifically focusing on German (De) to English (En) translations. For the story generation task, we use the ROCstories (Mostafazadeh et al., 2016) corpus. |
| Dataset Splits | Yes | We adopt the official split of train/valid/test sets. There are 90,000 samples in the train set, and 4081 samples in the validation and test sets. |
| Hardware Specification | Yes | All experiments are conducted on an Amazon EC2 P3 instance equipped with four NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions that the implementation is based on 'fairseq', but does not provide specific version numbers for software dependencies like fairseq itself, Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We use the Adam optimizer (Kingma & Ba, 2015) with β = (0.9, 0.98) and set the learning rate to 0.0005. Additionally, we incorporate 4,000 warm-up steps. The learning rate then decreases proportionally to the inverse square root of the step number. By default, we use beam search as the decoding method (beam size = 5). |