Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Measuring Non-Adversarial Reproduction of Training Data in Large Language Models

Authors: Michael Aerni, Javier Rando, Edoardo Debenedetti, Nicholas Carlini, Daphne Ippolito, Florian Tramer

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Concretely, we collect outputs from state-of-the-art conversational LLMs prompted with a variety of common and benign tasks (including real conversations from Wild Chat (Zhao et al., 2024) and LMSYS-Chat-1M (Zheng et al., 2023)). We then measure the fraction of generated text that overlaps (to varying degrees) with snippets of text from the public Web, and compare this with human-written baselines for the same tasks.
Researcher Affiliation Collaboration 1ETH Zurich 2Google Deep Mind 3Carnegie Mellon University
Pseudocode No The paper describes methods and procedures in narrative text and mathematical formulas (e.g., in Appendix A.2) but does not include any clearly labeled pseudocode or algorithm blocks for its own methodology.
Open Source Code Yes Code and data: https://github.com/ethz-spylab/non-adversarial-reproduction. ... We release all our code for inference and analysis.
Open Datasets Yes Concretely, we collect outputs from state-of-the-art conversational LLMs prompted with a variety of common and benign tasks (including real conversations from Wild Chat (Zhao et al., 2024) and LMSYS-Chat-1M (Zheng et al., 2023)). ... We manually define different tasks and generate corresponding prompts, e.g., Write a travel blog post about Rwanda. . 2. We collect prompts from real-world sources, e.g., the PERSUADE 2.0 (Crossley et al., 2023) essay corpus or the r/Writing Prompts and r/explainlikeimfive subreddits. ... We release all data that is free from copyright concerns via https://github. com/ethz-spylab/non-adversarial-reproduction.
Dataset Splits No The paper collects prompts and analyzes existing LLM generations and human-written texts, and compares them. It does not train a model itself, hence, the concept of training/test/validation splits for its own experimental methodology is not applicable. The data used for analysis are described as collected sets of prompts and existing conversations.
Hardware Specification No For Llama models, we use the API of https://deepinfra.com/ and otherwise the API of each model s creator. The paper relies on black-box inference APIs and does not specify the hardware used for these inferences or for their own analysis.
Software Dependencies No The paper states, 'We release all our code for inference and analysis,' but does not provide specific software dependencies (libraries, frameworks, or languages) with their version numbers that are required to replicate its methodology.
Experiment Setup Yes For all models, we sample with temperature 0.7 as is typical in practice... For every prompt, we run LLM inference with temperatures 0.7 and 0; we mainly report results at temperature 0.7. If not mentioned otherwise, we use 5 different seeds at temperature 0.7 to reduce variance.