Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Efficient Verified Unlearning For Distillation

Authors: Yijun Quan, Zushu Li, Giovanni Montana

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide both theoretical analysis, quantifying significant speed-ups in the unlearning process, and empirical validation on multiple datasets, demonstrating that PURGE achieves these efficiency gains while maintaining student accuracy comparable to standard baselines.
Researcher Affiliation	Academia	Yijun Quan, Zushu Li, and Giovanni Montana Warwick Manufacturing Group University of Warwick CV4 7AL EMAIL
Pseudocode	Yes	To enhance clarity for the reader, we present the PURGE training procedure in pseudo-code in Appendix A.1.
Open Source Code	Yes	Justification: The experiments are done on 4 public datasets, and the code is released.
Open Datasets	Yes	Evaluated on both image classification and sentiment analysis tasks using public datasets, including MNIST[8], SVHN[12], CIFAR-100[20] and SST5[28], our proposed PURGE delivers significant speed-ups over SISA while preserving model performance across various conditions.
Dataset Splits	Yes	Justification: Training details like the number of epochs are provided in the paper, or otherwise included in the publicly available code. We use standard train, test, and validation splits from public datasets. The random seeds used for data splits across constituent models are included in the code.
Hardware Specification	Yes	The image classification tasks were conducted on a machine equipped with one NVIDIA RTX 3090 GPU (24GB VRAM) while the sentiment classification task was conducted on a machine with 8 NVIDIA A100 GPUs (80GB VRAM each) to support the large number of BERT constituent models.
Software Dependencies	No	The paper does not explicitly state specific software dependencies with version numbers in the provided text.
Experiment Setup	Yes	Following the setup described in Section 3.2, we assume each slice is trained for the same number of epochs, e R. This is related to the equivalent full-dataset training epochs e by e R = 2 rc+1e [1], ensuring comparable total computation during initial training. For this experiment, we set e = 120 and evaluate on the MNIST dataset. ... We configured the teacher network with M = 32 constituent models. ... For PURGE, we varied the number of student constituent models N from 1 to 32. We tested two configurations for the number of slices per chunk: r = 1 and r = 4.