Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Language Models May Verbatim Complete Text They Were Not Explicitly Trained On

Authors: Ken Liu, Christopher A. Choquette-Choo, Matthew Jagielski, Peter Kairouz, Sanmi Koyejo, Percy Liang, Nicolas Papernot

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our work, we ﬁrst ﬁnd that even after removing a set of extracted sequences from the training dataset and retraining the LLM from scratch, the retrained model can still verbatim complete 40% of them under our experimental conditions (Section 4). Upon investigation, we ﬁnd that these removed yet still completed sequences are either de facto members of the training set (but for a different deﬁnition of membership) or lacking sufﬁcient complexity: many examples have near duplicates, sequences with m < n-grams that are not removed, or are explained by the model s generalization capabilities (e.g., patterns or counting).
Researcher Affiliation	Collaboration	1Google 2Work completed while on internship at Google Deep Mind. Now at Stanford University. 3Stanford University. Correspondence to: Ken Liu <EMAIL>, Christopher A. Choquette-Choo <EMAIL>.
Pseudocode	Yes	Algorithm 1 Fine-tuning sequences from Chunking ( 5.1) 1: Input: A sequence x of length n tokens, chunk size c, overlap l, random seed s 2: Output: A sequence x of with exactly one chunk from x at random position and the rest ﬁlled with random tokens 3: Set random seed to s 4: positions [ 0, (c l), 2(c l), . . . , (n l) ] (possible positions for the start of the chunk) 5: p randomly choose from positions 6: x sequence of length n tokens, initialized with placeholders 7: x[p : p + c] x[p : p + c] (copy a chunk from x, and truncate if needed) 8: for each placeholder in x do 9: replace it with a random token from the tokenizer s vocabulary 10: end for 11: return x
Open Source Code	No	The paper discusses the methodology but does not explicitly state that the source code for their specific methods (e.g., n-gram filtering, adversarial dataset construction techniques) is publicly available or provided. It mentions using "LLM.c (Karpathy, 2024) for an efﬁcient pre-training pipeline" which is a third-party tool, but not their own code for the paper's novel contributions.
Open Datasets	Yes	Data. For all models, we use Fine Web-Edu (Penedo et al., 2024) as a state-of-the-art pre-training dataset.
Dataset Splits	Yes	2. Identify verbatim completions: We then collect a set of sequences Dmem of length k that Mbase can complete verbatim (as in Def. 3.2), by checking the ﬁrst k tokens of every training document in Dbase. This is a simple and effective procedure since LLMs are known to memorize training data (e.g., Carlini et al. (2022b)); other choices to obtain Dmem are also possible. 3. n-gram ﬁltering: We then ﬁlter each sequence x Dmem away from Dbase. ... The ﬁltered dataset is denoted as D(n) ﬁlter.
Hardware Specification	Yes	Compute 8 NVIDIA H100 days (1.6B parameter model)
Software Dependencies	Yes	We use LLM.c (Karpathy, 2024) for an efﬁcient pre-training pipeline.
Experiment Setup	Yes	Table 9: Training conﬁgurations for pre-training experiments. # Training Tokens 33.6 billion Micro-Batch Size 16 Max Sequence Length 1024 Total Batch Size 220 = 1, 048, 576 tokens Gradient Accumulation Steps 8 Weight Decay 0.1 Learning Rate 6e-4 LR Schedule Cosine LR Decay decay to 10% of max LR Warmup Iterations 700 iterations Total Training Steps 32,000