Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Knowledge Distillation Detection for Open-weights Models

Authors: Qin Shi, Amber Yijia Zheng, Qifan Song, Raymond A. Yeh

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on diverse architectures for image classification and text-to-image generation show that our method improves detection accuracy over the strongest baselines by 59.6% on CIFAR-10, 71.2% on Image Net, and 20.0% for text-to-image generation.
Researcher Affiliation	Academia	1Department of Statistics, Purdue University 2Department of Computer Science, Purdue University EMAIL
Pseudocode	No	The paper describes a three-stage framework (input construction, score computation, and decision making) and provides a pipeline illustration in Figure 1, but it does not include a formal pseudocode block or algorithm.
Open Source Code	Yes	The code is available at https://github.com/shqii1j/distillation_detection.
Open Datasets	Yes	For image classification, we use two standard datasets: CIFAR-10 and Image Net [14].
Dataset Splits	No	The paper uses standard datasets like CIFAR-10 and Image Net, and mentions distilling student models on Image Net-100, but it does not explicitly provide specific details about how these datasets were split into training, validation, or test sets (e.g., percentages or sample counts).
Hardware Specification	Yes	For training teacher and student models on CIFAR-10, we use a single A30 GPU. For Image Net, student models are trained using two L40S GPUs. In the first stage of our knowledge distillation detection pipeline, we train a generator to produce synthesized images in classification tasks using a single L40S GPU. For score computation and prediction, we only perform inference with the student and teacher models. This stage uses a single A30 GPU for CIFAR-10 classification models, and a single L40S GPU for Image Net classification and text-to-image generation models.
Software Dependencies	No	The paper describes the use of optimizers like SGD and Adam, and various architectural components, but it does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We apply the same training strategy: stochastic gradient descent (SGD) with a learning rate of 0.01, momentum of 0.9, and weight decay of 5e-4; training length of 40 epochs with a batch size of 64. We apply the One Cycle learning rate scheduler, with a maximum learning rate of 0.1, computed over 45000/64 steps per epoch and updated at every training step.