Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Blackbox Model Provenance via Palimpsestic Membership Inference

Authors: Rohith Kuditipudi, Jing Huang, Sally Zhu, Diyi Yang, Chris Potts, Percy Liang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate our tests in Section 4 using the Pythia ... model families, as well as small-scale models we train on Tiny Stories [5, 6, 7]. Finally, we conclude with a discussion of key takeaways and directions for future work in Section 5. We release code and data for reproducing experiments. 4 Experiments Transcript (α). We use the ordered pretraining data from various open-source language models as Alice s transcript. We consider five families of models, each corresponding to a different pretraining dataset ranging from 300B to 4T tokens: (1) pythia: The Pile dataset used for training Pythia models [5]; (2) pythia-deduped: The deduped version of The Pile used for training Pythia-deduped models; (3) OLMo: Dolma v1.5 dataset used for training OLMo models [6]; (4) OLMo-1.7: Dolma v1.7 dataset used for training OLMo-0424 and OLMo-0724 models; and (5) OLMo-2: OLMo-Mix used for training the stage1 of OLMo-2-1124 models [7]. Additionally, we use Tiny Stories [40] to train small-scale models for ablations that would otherwise be prohibitively expensive, such as trying multiple epochs of training. We subsample sequences from each dataset to conduct our test (see Appendices A and B regarding sampling details for the query and observational settings respectively).
Researcher Affiliation Academia Rohith Kuditipudi EMAIL Jing Huang EMAIL Sally Zhu EMAIL Diyi Yang EMAIL Christopher Potts EMAIL Percy Liang EMAIL Department of Computer Science Stanford University
Pseudocode Yes Algorithm 1: Obtaining p-values from arbitrary test statistics Input: Transcript α = {(xi, ti)}n i=1; artifact β Parameters : test statistic ϕ; number of permutations m Output: p-value ˆp (0, 1] 1 for j 1, . . . , m do 2 σj Unif([N] [N]); αj = {(xi, σj(ti))}N i=1 3 ϕj ϕ(αj, β) 4 ˆp 1 1 m+1(1 + Pm j=1 1{ϕj < ϕ(α, β)}) // break ties randomly 5 return ˆp Algorithm 2: Training models on partitioned transcript (ϕpart obs ) Input: Transcript α = {(xi, ti)}n i=1; text xβ X Parameters : number of models k; metric χ Output: test statistic ϕpart obs (Γ, β) 1 Sort examples x by indices t 2 Split sorted x into x1, ..., xk contiguous partitions and train models µ1, ..., µk on partitions 3 return ρ({χ(µj, xβ), {j}k j=1}) Algorithm 3: Training models on shuffled transcript (ϕshuff obs ) Input: Transcript α = {(xi, ti)}n i=1; text xβ X Parameters : number of models k; metric χ Output: test statistic ϕshuff obs (Γ, β) 1 Sort examples x by indices t 2 Train model µ0 on x (in sorted order) and models µ1, ..., µk on independent reshuffles of x 3 µ (1/k) Pk i=1 χ(µi, xβ); σ q (1/(k 1)) Pk i=1(χ(µi, xβ) µ)2 4 return χ(µ0, xβ) µ /σ
Open Source Code Yes We release code and data for reproducing experiments.5 5https://github.com/Rohith Kuditipudi/blackbox-model-tracing.
Open Datasets Yes We empirically validate our tests in Section 4 using the Pythia (trained on pythia and pythia-deduped, the deduped and non-deduped Pile datasets used to train Pythia models) and OLMo (trained on OLMo, OLMo-1.7, and OLMo-2, the Dolma and OLMo-Mix datasets) model families, as well as small-scale models we train on Tiny Stories [5, 6, 7].
Dataset Splits Yes Sampling text. As before, to obtain Bob s text xβ we independently generate short texts then group these texts together. We generate these texts as continuations of prefixes from the Tiny Stories test set.
Hardware Specification Yes We run our experiments on an internal cluster using NVIDIA A100 and A6000 GPUs.
Software Dependencies No The paper mentions software components like "infini-gram code base [42]" but does not specify version numbers for general software dependencies like Python or PyTorch, or for the infini-gram tool itself.
Experiment Setup Yes For each of the five families, we use up to 1M 64-token sequences randomly sampled from the first epoch as our transcript α and evaluate the 7B-scale model checkpoint at the end of the first epoch. Test statistic (ϕ). We primarily experiment with a version of ϕpart obs using n-gram models (n = 8) wherein we let χ count the number of exact matches among Bob s text with the n-gram index underlying each model and let k be the total number of minibatches in Alice s training run... On Tiny Stories, we use the same model architecture (d_model = 256, d_ffn = 512, num_layers = 4, approximately 3M parameters) as in the multiple epoch experiments. For the observational setting experiments, we train for a single epoch on 500K documents with a constant learning rate of 1 10 5 and 4 documents per batch. We save checkpoints every 10k documents starting at 450K documents, which we use to resume training on reshuffled data to obtain the models µ1, ..., µk in our implementation of ϕshuff obs .