Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
De-mark: Watermark Removal in Large Language Models
Authors: Ruibo Chen, Yihan Wu, Junfeng Guo, Heng Huang
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on popular LMs, such as Llama3 and Chat GPT, demonstrate the efficiency and effectiveness of DE-MARK in watermark removal and exploitation tasks. Through extensive experiments, we demonstrate the efficacy of DE-MARK in watermark removal and exploitation tasks from well-known language models, such as Llama3 and Mistral. Additionally, a case study on Chat GPT further confirms DE-MARK s capability to effectively remove watermarks from industry-scale. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Maryland, College Park. Correspondence to: Ruibo Chen <EMAIL>, Yihan Wu <EMAIL>, Junfeng Guo <EMAIL>, Heng Huang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Calculate relative probability ratio Algorithm 2 Calculate token score Algorithm 3 Identify the prefix n-gram length h Algorithm 4 Identify watermark strength δ Algorithm 5 Identify watermark green list |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | Following the previous work (Jovanovi c et al., 2024), we use three datasets for our experiments, Dolly creative writing (Conover et al., 2023), MMW Book Report (Piet et al., 2023), and MMW Story (Piet et al., 2023), each contains about 100 prompts for open-end writing. We also include additional experiments on Water Bench (Tu et al., 2023), selecting two tasks (Entity Probing and Concept Probing) with short outputs, and one task (Long-form QA) with long outputs. |
| Dataset Splits | No | The paper mentions datasets for open-end writing with a certain number of prompts (e.g., 'about 100 prompts') and specifies generating '300 tokens for each prompt'. However, it does not explicitly provide information on how these datasets were split into training, validation, or test sets in a way that would allow for reproduction of data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running the experiments. |
| Software Dependencies | Yes | We use gpt-3.5-turbo-0125 to evaluate the results. |
| Experiment Setup | Yes | For watermark config, we choose prefix n-gram length h = 3, watermark strength δ = 2, watermark token ratio (Kirchenbauer et al., 2023a) is 0.5. We use the z score (Kirchenbauer et al., 2023a) for watermark detection, and report the median p-value (false positive rate) in our experiments. For DE-MARK hyperparameters, we use α1 = 0.2, α2 = 10, β = 0.8, γ = 0.1. The target token size m in Alg.3 and Alg.4 is set to 50. The repeat time c in Alg.4 is set to 5. In gray-box setting (L1), we use top-20 log-probabilities. We generate 300 tokens for each prompt, and suppress the EOS token following (Kirchenbauer et al., 2023a). |