Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

De-mark: Watermark Removal in Large Language Models

Authors: Ruibo Chen, Yihan Wu, Junfeng Guo, Heng Huang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on popular LMs, such as Llama3 and Chat GPT, demonstrate the efficiency and effectiveness of DE-MARK in watermark removal and exploitation tasks. Through extensive experiments, we demonstrate the efficacy of DE-MARK in watermark removal and exploitation tasks from well-known language models, such as Llama3 and Mistral. Additionally, a case study on Chat GPT further confirms DE-MARK s capability to effectively remove watermarks from industry-scale.
Researcher Affiliation	Academia	1Department of Computer Science, University of Maryland, College Park. Correspondence to: Ruibo Chen <EMAIL>, Yihan Wu <EMAIL>, Junfeng Guo <EMAIL>, Heng Huang <EMAIL>.
Pseudocode	Yes	Algorithm 1 Calculate relative probability ratio Algorithm 2 Calculate token score Algorithm 3 Identify the prefix n-gram length h Algorithm 4 Identify watermark strength δ Algorithm 5 Identify watermark green list
Open Source Code	No	The paper does not contain an explicit statement about releasing source code or a link to a code repository.
Open Datasets	Yes	Following the previous work (Jovanovi c et al., 2024), we use three datasets for our experiments, Dolly creative writing (Conover et al., 2023), MMW Book Report (Piet et al., 2023), and MMW Story (Piet et al., 2023), each contains about 100 prompts for open-end writing. We also include additional experiments on Water Bench (Tu et al., 2023), selecting two tasks (Entity Probing and Concept Probing) with short outputs, and one task (Long-form QA) with long outputs.
Dataset Splits	No	The paper mentions datasets for open-end writing with a certain number of prompts (e.g., 'about 100 prompts') and specifies generating '300 tokens for each prompt'. However, it does not explicitly provide information on how these datasets were split into training, validation, or test sets in a way that would allow for reproduction of data partitioning.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running the experiments.
Software Dependencies	Yes	We use gpt-3.5-turbo-0125 to evaluate the results.
Experiment Setup	Yes	For watermark config, we choose prefix n-gram length h = 3, watermark strength δ = 2, watermark token ratio (Kirchenbauer et al., 2023a) is 0.5. We use the z score (Kirchenbauer et al., 2023a) for watermark detection, and report the median p-value (false positive rate) in our experiments. For DE-MARK hyperparameters, we use α1 = 0.2, α2 = 10, β = 0.8, γ = 0.1. The target token size m in Alg.3 and Alg.4 is set to 50. The repeat time c in Alg.4 is set to 5. In gray-box setting (L1), we use top-20 log-probabilities. We generate 300 tokens for each prompt, and suppress the EOS token following (Kirchenbauer et al., 2023a).