A Semantic Invariant Robust Watermark for Large Language Models

Authors: Aiwei Liu, Leyi Pan, Xuming Hu, Shiao Meng, Lijie Wen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the experiment, we evaluate the attack robustness of our watermarking algorithm against various semantically invariant perturbations, including text paraphrasing and synonym replacement. Overall, our watermark robustness is comparable to KGW-1 (global watermark logits), which is close to the robustness upper bound achievable by watermark logit-based methods. Additionally, employing the spoofing attack paradigm used in Sadasivan et al. (2023), we evaluate the decryption accuracy of various watermarking methodologies to gauge security robustness. Our algorithm demonstrates favorable security robustness metrics, effectively resolving the previously encountered trade-off between attack and security robustness.
Researcher Affiliation Academia Aiwei Liu1, Leyi Pan1, Xuming Hu2, Shiao Meng1, Lijie Wen1 , 1School of Software, BNRist, Tsinghua University 2The Hong Kong University of Science and Technology (Guangzhou) liuaw20@mails.tsinghua.edu.cn, wenlj@tsinghua.edu.cn,
Pseudocode Yes Algorithm 1 Watermark Generation 1: Input: watermark strength δ, a language model M, previous generated text t = [t0....tl 1], a text embedding language model E, a trained watermark model T. 2: Generate the next token logits from PM: PM(xprompt, t:l 1). 3: Generate sentence embedding el = E(t:l 1). 4: Generate watermark logit PW from trained watermark model T(el). 5: Define a new language model ˆM where given input t = [t0....tl 1], the resulting logits satisfy P ˆM(xprompt, t:l 1) = PM(xprompt, t:l 1) + δ PW(xprompt, t:l 1). 6: Output: watermarked next token logits P ˆM(tl).
Open Source Code Yes Our code and data are available at https://github.com/THU-BPM/Robust_Watermark.
Open Datasets Yes Dataset and Prompt: Similar to the previous works Kirchenbauer et al. (2023a), we utilize the C4 dataset (Raffel et al., 2020) for data generation, taking the first 30 tokens as prompts and generating the next 200 tokens. The original C4 texts serve as human-written examples. The test objective is to distinguish between generated text and human-written text. During training of the watermark model, we utilize the WikiText-103 dataset (Merity et al., 2016) (different from C4) to generate embeddings.
Dataset Splits No The paper uses the C4 dataset for data generation and WikiText-103 for training the watermark model, but it does not explicitly provide specific training/validation/test splits (e.g., percentages or sample counts) for the datasets used in their experiments. It mentions setting false positive rates for evaluation but not the specific dataset partitioning.
Hardware Specification Yes Hyper-parameters: ... All experiments were conducted using the NVIDIA Tesla V100 32G GPU.
Software Dependencies No The paper mentions several tools and models like GPT3.5, DIPPER, WordNet, BERT, Compositional-BERT, Sentence-BERT, LLaMA-7B, OPT1.3B, OPT2.7B, LLaMA-13B, NLLB-200-distilled-600M and Adam optimizer. However, it does not provide specific version numbers for any of these software dependencies, which is required for reproducibility.
Experiment Setup Yes Hyper-parameters: The watermark model uses a four-layer fully connected residual network with rectified linear unit activations. Hyperparameters are set to k1 = 20, k2 = 1000, λ1 = 10, λ2 = 0.1, and the Adam optimizer (lr=1e-5) is used for training.