Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Model Provenance Testing for Large Language Models
Authors: Ivica Nikolic, Teodora Baluta, Prateek Saxena
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On two comprehensive real-world benchmarks spanning models from 30M to 4B parameters and comprising over 600 models, our tester achieves 90 95% precision and 80 90% recall in identifying derived models. These results demonstrate the viability of systematic provenance verification in production environments even when only API access is available. |
| Researcher Affiliation | Academia | Ivica Nikoli c National University of Singapore Singapore EMAIL Teodora Baluta Georgia Institute of Technology Georgia, USA EMAIL Prateek Saxena National University of Singapore Singapore EMAIL |
| Pseudocode | Yes | Algorithm 1 Provenance Tester for g Given a Candidate Parent Set |
| Open Source Code | Yes | The implementation of the tester along with the two benchmarks can be found at https://github .com/ivicanikolicsg/model_provenance_testing. |
| Open Datasets | Yes | We collect model candidates for all provenance pairs from the Hugging Face (HF) platform [21]. |
| Dataset Splits | Yes | From BENCH-A, we take all 100 true pairs (Pi, Ci) and create 100 false pairs ( Pi, Ci) by selecting one random non-parent Pi = Pi for each child Ci. This ensures a balanced dataset where random guessing would achieve 50% accuracy. We similarly obtain 766 testing pairs from BENCH-B5. |
| Hardware Specification | Yes | We run our model provenance testers on a Linux machine with 64-bit Ubuntu 22.04.3 LTS, 128GB RAM and 2x 24 CPU AMD EPYC 7443P @1.50GHz and 4x NVIDIA A40 GPUs with 48GB RAM. |
| Software Dependencies | No | The paper mentions "All experiments are implemented using Py Torch framework [35] and the Hugging Face Transformers library [46]" but does not specify version numbers for these software components. |
| Experiment Setup | Yes | We use the standard significance α = 0.05 (see Appendix G for other values). Sampling of prompts is given in Appendix D. Table 2 compares the performance of the base tester and the BAI-enhanced version across different query budgets T {500, 1000, 2000}. |