Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Language Models May Verbatim Complete Text They Were Not Explicitly Trained On

Authors: Ken Liu, Christopher A. Choquette-Choo, Matthew Jagielski, Peter Kairouz, Sanmi Koyejo, Percy Liang, Nicolas Papernot

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our work, we first find that even after removing a set of extracted sequences from the training dataset and retraining the LLM from scratch, the retrained model can still verbatim complete 40% of them under our experimental conditions (Section 4). Upon investigation, we find that these removed yet still completed sequences are either de facto members of the training set (but for a different definition of membership) or lacking sufficient complexity: many examples have near duplicates, sequences with m < n-grams that are not removed, or are explained by the model s generalization capabilities (e.g., patterns or counting).
Researcher Affiliation Collaboration 1Google 2Work completed while on internship at Google Deep Mind. Now at Stanford University. 3Stanford University. Correspondence to: Ken Liu <EMAIL>, Christopher A. Choquette-Choo <EMAIL>.
Pseudocode Yes Algorithm 1 Fine-tuning sequences from Chunking ( 5.1) 1: Input: A sequence x of length n tokens, chunk size c, overlap l, random seed s 2: Output: A sequence x of with exactly one chunk from x at random position and the rest filled with random tokens 3: Set random seed to s 4: positions [ 0, (c l), 2(c l), . . . , (n l) ] (possible positions for the start of the chunk) 5: p randomly choose from positions 6: x sequence of length n tokens, initialized with placeholders 7: x[p : p + c] x[p : p + c] (copy a chunk from x, and truncate if needed) 8: for each placeholder in x do 9: replace it with a random token from the tokenizer s vocabulary 10: end for 11: return x
Open Source Code No The paper discusses the methodology but does not explicitly state that the source code for their specific methods (e.g., n-gram filtering, adversarial dataset construction techniques) is publicly available or provided. It mentions using "LLM.c (Karpathy, 2024) for an efficient pre-training pipeline" which is a third-party tool, but not their own code for the paper's novel contributions.
Open Datasets Yes Data. For all models, we use Fine Web-Edu (Penedo et al., 2024) as a state-of-the-art pre-training dataset.
Dataset Splits Yes 2. Identify verbatim completions: We then collect a set of sequences Dmem of length k that Mbase can complete verbatim (as in Def. 3.2), by checking the first k tokens of every training document in Dbase. This is a simple and effective procedure since LLMs are known to memorize training data (e.g., Carlini et al. (2022b)); other choices to obtain Dmem are also possible. 3. n-gram filtering: We then filter each sequence x Dmem away from Dbase. ... The filtered dataset is denoted as D(n) filter.
Hardware Specification Yes Compute 8 NVIDIA H100 days (1.6B parameter model)
Software Dependencies Yes We use LLM.c (Karpathy, 2024) for an efficient pre-training pipeline.
Experiment Setup Yes Table 9: Training configurations for pre-training experiments. # Training Tokens 33.6 billion Micro-Batch Size 16 Max Sequence Length 1024 Total Batch Size 220 = 1, 048, 576 tokens Gradient Accumulation Steps 8 Weight Decay 0.1 Learning Rate 6e-4 LR Schedule Cosine LR Decay decay to 10% of max LR Warmup Iterations 700 iterations Total Training Steps 32,000