Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Measuring Non-Adversarial Reproduction of Training Data in Large Language Models

Authors: Michael Aerni, Javier Rando, Edoardo Debenedetti, Nicholas Carlini, Daphne Ippolito, Florian Tramer

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Concretely, we collect outputs from state-of-the-art conversational LLMs prompted with a variety of common and benign tasks (including real conversations from Wild Chat (Zhao et al., 2024) and LMSYS-Chat-1M (Zheng et al., 2023)). We then measure the fraction of generated text that overlaps (to varying degrees) with snippets of text from the public Web, and compare this with human-written baselines for the same tasks.
Researcher Affiliation	Collaboration	1ETH Zurich 2Google Deep Mind 3Carnegie Mellon University
Pseudocode	No	The paper describes methods and procedures in narrative text and mathematical formulas (e.g., in Appendix A.2) but does not include any clearly labeled pseudocode or algorithm blocks for its own methodology.
Open Source Code	Yes	Code and data: https://github.com/ethz-spylab/non-adversarial-reproduction. ... We release all our code for inference and analysis.
Open Datasets	Yes	Concretely, we collect outputs from state-of-the-art conversational LLMs prompted with a variety of common and benign tasks (including real conversations from Wild Chat (Zhao et al., 2024) and LMSYS-Chat-1M (Zheng et al., 2023)). ... We manually define different tasks and generate corresponding prompts, e.g., Write a travel blog post about Rwanda. . 2. We collect prompts from real-world sources, e.g., the PERSUADE 2.0 (Crossley et al., 2023) essay corpus or the r/Writing Prompts and r/explainlikeimfive subreddits. ... We release all data that is free from copyright concerns via https://github. com/ethz-spylab/non-adversarial-reproduction.
Dataset Splits	No	The paper collects prompts and analyzes existing LLM generations and human-written texts, and compares them. It does not train a model itself, hence, the concept of training/test/validation splits for its own experimental methodology is not applicable. The data used for analysis are described as collected sets of prompts and existing conversations.
Hardware Specification	No	For Llama models, we use the API of https://deepinfra.com/ and otherwise the API of each model s creator. The paper relies on black-box inference APIs and does not specify the hardware used for these inferences or for their own analysis.
Software Dependencies	No	The paper states, 'We release all our code for inference and analysis,' but does not provide specific software dependencies (libraries, frameworks, or languages) with their version numbers that are required to replicate its methodology.
Experiment Setup	Yes	For all models, we sample with temperature 0.7 as is typical in practice... For every prompt, we run LLM inference with temperatures 0.7 and 0; we mainly report results at temperature 0.7. If not mentioned otherwise, we use 5 different seeds at temperature 0.7 to reduce variance.