Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Model Provenance Testing for Large Language Models

Authors: Ivica Nikolic, Teodora Baluta, Prateek Saxena

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On two comprehensive real-world benchmarks spanning models from 30M to 4B parameters and comprising over 600 models, our tester achieves 90 95% precision and 80 90% recall in identifying derived models. These results demonstrate the viability of systematic provenance verification in production environments even when only API access is available.
Researcher Affiliation	Academia	Ivica Nikoli c National University of Singapore Singapore EMAIL Teodora Baluta Georgia Institute of Technology Georgia, USA EMAIL Prateek Saxena National University of Singapore Singapore EMAIL
Pseudocode	Yes	Algorithm 1 Provenance Tester for g Given a Candidate Parent Set
Open Source Code	Yes	The implementation of the tester along with the two benchmarks can be found at https://github .com/ivicanikolicsg/model_provenance_testing.
Open Datasets	Yes	We collect model candidates for all provenance pairs from the Hugging Face (HF) platform [21].
Dataset Splits	Yes	From BENCH-A, we take all 100 true pairs (Pi, Ci) and create 100 false pairs ( Pi, Ci) by selecting one random non-parent Pi = Pi for each child Ci. This ensures a balanced dataset where random guessing would achieve 50% accuracy. We similarly obtain 766 testing pairs from BENCH-B5.
Hardware Specification	Yes	We run our model provenance testers on a Linux machine with 64-bit Ubuntu 22.04.3 LTS, 128GB RAM and 2x 24 CPU AMD EPYC 7443P @1.50GHz and 4x NVIDIA A40 GPUs with 48GB RAM.
Software Dependencies	No	The paper mentions "All experiments are implemented using Py Torch framework [35] and the Hugging Face Transformers library [46]" but does not specify version numbers for these software components.
Experiment Setup	Yes	We use the standard significance α = 0.05 (see Appendix G for other values). Sampling of prompts is given in Appendix D. Table 2 compares the performance of the base tester and the BAI-enhanced version across different query budgets T {500, 1000, 2000}.