Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Ward: Provable RAG Dataset Inference via LLM Watermarks

Authors: Nikola Jovanović, Robin Staab, Maximilian Baader, Martin Vechev

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental evaluation in a wide range of settings affirms the fundamental limitations of existing datasets and all RAG-DI baselines, and demonstrates the effectiveness of WARD, which consistently shows high accuracy, query efficiency, and robustness ( 5).
Researcher Affiliation	Academia	Nikola Jovanovi c, Robin Staab, Maximilian Baader, Martin Vechev ETH Zurich EMAIL
Pseudocode	No	The paper describes methods in prose, without explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our source code and the FARAD dataset are publicly available at https://github.com/eth-sri/ward.
Open Datasets	Yes	Our source code and the FARAD dataset are publicly available at https://github.com/eth-sri/ward. The FARAD dataset FARAD consists of a number of groups. Each group contains articles that share a topic and a significant amount of information, but are independently written by a different (LLM) author. As our data source we use Rep Li QA (Monteiro et al., 2024), that contains articles about fictional entities and events, which by design ensures that this knowledge was not present in any LLM training data.
Dataset Splits	Yes	For the Easy setting, we sample four subsets of distinct groups, where the sizes of the subsets are respectively (200, 300, 300, 200). Then for each subset i {1, 2, 3}, we only take articles from Ai, and include all of them in the RAG corpus D. Out of those, articles from subset 1 of author A1 are taken as Ddo in the IN case, i.e., these are potentially modified by the service provider before inserting them into the RAG corpus. Similarly, the articles from subset 4 of author A4 are reserved as Ddo in the OUT case. This setup ensures no fact redundancy in the RAG corpus. In contrast, to create the Hard setting with fact redundancy, we start by sampling 1000 distinct groups. The RAG corpus D is then built by including all articles from those groups that were written by A1, A2, and A3. A randomly subsampled set of 200 of those articles, that were written by A1, is taken as Ddo in the IN case. Similarly, randomly sampling 200 of the 1000 above groups, and taking documents from A4 out of each group is used as Ddo in the OUT case.
Hardware Specification	No	The paper mentions several LLM models (GPT3.5, CLAUDE3-HAIKU, LLAMA3.1-70B, LLAMA3.1-8B) used for experiments, but does not specify the underlying hardware (GPU/CPU models, etc.) on which these models were run.
Software Dependencies	No	The paper mentions specific LLM models (LLAMA3.1-8B, GPT3.5, CLAUDE3-HAIKU, LLAMA3.1-70B) and embedding models (ALL-MINILM-L6-V2), but does not provide specific version numbers for general software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	Our experimental setup follows 3: we use FARAD to define two evaluation settings, and in both evaluate IN and OUT cases, i.e., where the data owner s data is (resp. is not) contained in D. We use \|Ddo\| = 200, and \|D\| = 800 for FARAD-Easy, and \|D\| = 3000 for FARAD-Hard (sampling detailed in App. B.1). We use several LLMs as M: GPT3.5, CLAUDE3-HAIKU, and LLAMA3.1-70B, and vary the system prompt: we use a short naive prompt (Naive-P) with basic RAG instructions, and a longer defense (Def-P) prompt... Each experiment is run with 5 random seeds. If not specified otherwise (see 5.4), the RAG uses k = 3 shots. For WARD, we use Position PRF (Kirchenbauer et al., 2024), h = 2, and δ = 3.5, ablating these in 5.4.