Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Tools for Verifying Neural Models' Training Data

Authors: Dami Choi, Yonadav Shavit, David K. Duvenaud

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show experimentally that our verification procedures can catch a wide variety of attacks, including all known attacks from the Proof-of-Learning literature. ... We demonstrate the practical effectiveness of our defenses via experiments on two language models (Section 6).
Researcher Affiliation Academia U. Toronto & Vector Institute EMAIL Yonadav Shavit Harvard University EMAIL David Duvenaud U. Toronto & Vector Institute EMAIL
Pseudocode No The paper describes a combined verification protocol with numbered steps in Appendix A, but it does not present this or any other procedure in a formal pseudocode or algorithm block.
Open Source Code No The paper does not provide any statements about releasing its source code or links to a code repository for the methodology described.
Open Datasets Yes Our main experiments are run on GPT-2 [RWC+19] with 124M parameters and trained on the Open Web Text dataset [GCPT19]. ... The data addition attack experiments in Section 6 further use the Github component of the Pile dataset [GBB+20]... In addition to training our own models, we also evaluate Pythia checkpoints [BSA+23] published by Eleuther AI... trained on the Pile dataset.
Dataset Splits No The paper mentions using a "validation set Dv" and discusses how a Prover can construct a "validation subset Dv by holding out the last nv data-points". However, it does not specify the concrete size or proportion of the validation split used in their own experiments.
Hardware Specification Yes All experiments were done using 4 NVIDIA A40 GPUs.
Software Dependencies No The paper mentions specific models (GPT-2, Pythia) which imply certain underlying frameworks, but it does not provide specific version numbers for any software, libraries, or dependencies used in the experiments.
Experiment Setup Yes We use a batch size of 491,520 tokens and train for 18,000 steps... saving a checkpoint every 1000 steps. ... We use a cosine learning rate schedule that decays by a factor of 10x by the end of training, with a linear warmup of 2000 steps to a peak learning rate of 0.0006.