Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Measuring Non-Adversarial Reproduction of Training Data in Large Language Models
Authors: Michael Aerni, Javier Rando, Edoardo Debenedetti, Nicholas Carlini, Daphne Ippolito, Florian Tramer
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Concretely, we collect outputs from state-of-the-art conversational LLMs prompted with a variety of common and benign tasks (including real conversations from Wild Chat (Zhao et al., 2024) and LMSYS-Chat-1M (Zheng et al., 2023)). We then measure the fraction of generated text that overlaps (to varying degrees) with snippets of text from the public Web, and compare this with human-written baselines for the same tasks. |
| Researcher Affiliation | Collaboration | 1ETH Zurich 2Google Deep Mind 3Carnegie Mellon University |
| Pseudocode | No | The paper describes methods and procedures in narrative text and mathematical formulas (e.g., in Appendix A.2) but does not include any clearly labeled pseudocode or algorithm blocks for its own methodology. |
| Open Source Code | Yes | Code and data: https://github.com/ethz-spylab/non-adversarial-reproduction. ... We release all our code for inference and analysis. |
| Open Datasets | Yes | Concretely, we collect outputs from state-of-the-art conversational LLMs prompted with a variety of common and benign tasks (including real conversations from Wild Chat (Zhao et al., 2024) and LMSYS-Chat-1M (Zheng et al., 2023)). ... We manually define different tasks and generate corresponding prompts, e.g., Write a travel blog post about Rwanda. . 2. We collect prompts from real-world sources, e.g., the PERSUADE 2.0 (Crossley et al., 2023) essay corpus or the r/Writing Prompts and r/explainlikeimfive subreddits. ... We release all data that is free from copyright concerns via https://github. com/ethz-spylab/non-adversarial-reproduction. |
| Dataset Splits | No | The paper collects prompts and analyzes existing LLM generations and human-written texts, and compares them. It does not train a model itself, hence, the concept of training/test/validation splits for its own experimental methodology is not applicable. The data used for analysis are described as collected sets of prompts and existing conversations. |
| Hardware Specification | No | For Llama models, we use the API of https://deepinfra.com/ and otherwise the API of each model s creator. The paper relies on black-box inference APIs and does not specify the hardware used for these inferences or for their own analysis. |
| Software Dependencies | No | The paper states, 'We release all our code for inference and analysis,' but does not provide specific software dependencies (libraries, frameworks, or languages) with their version numbers that are required to replicate its methodology. |
| Experiment Setup | Yes | For all models, we sample with temperature 0.7 as is typical in practice... For every prompt, we run LLM inference with temperatures 0.7 and 0; we mainly report results at temperature 0.7. If not mentioned otherwise, we use 5 different seeds at temperature 0.7 to reduce variance. |