Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Reliable Cryptographic Framework for Empirical Machine Unlearning Evaluation

Authors: Yiwen Tu, Pingbang Hu, Jiaqi Ma

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide empirical evidence of the effectiveness of the proposed evaluation framework. In what follows, for brevity, we will use SWAP test to refer to the proposed practical approximations for calculating the proposed evaluation metric, which in reality is a combination of the SWAP test in Section 3.4 and other approximations discussed in Section 3.5. We further denote Q as the proposed metric, Unlearning Quality, calculated by the SWAP test. With these notations established, our goal is to validate the theoretical results, demonstrate additional observed benefits of the proposed Unlearning Quality metric, and ultimately show that it outperforms other attack-based evaluation metrics.
Researcher Affiliation Academia Yiwen Tu University of Michigan, Ann Arbor EMAIL Pingbang Hu University of Illinois Urbana-Champaign EMAIL Jiaqi W. Ma University of Illinois Urbana-Champaign EMAIL
Pseudocode Yes Algorithm 1: Dummy adversary A against a random 2-sets evaluation Data: An unlearned model m, a random oracle O Result: A one bit prediction b b while b = do x O b T(x) return b
Open Source Code Yes Our code is included in the supplementary materials.
Open Datasets Yes We focus on one of the most common tasks in the machine unlearning literature, image classification, and perform experiments on the CIFAR10 dataset [Krizhevsky et al., 2009], which is licensed under CC-BY 4.0. Moreover, we opt for Res Net [He et al., 2016] as the target model produced by some learning algorithms LR, whose details can be found in Appendix C.2. Finally, the following is the setup of the unlearning sample inference game G = (A, UL, D, PD, α) for the evaluation experiment: [...] We provide additional experiments on vision datasets CIFAR100 [Krizhevsky et al., 2009] and MNIST [Le Cun, 1998], and natural language dataset SST5 [Socher et al., 2013].
Dataset Splits Yes The game starts by randomly splitting the dataset D into three disjoint sets: a retain set R, a forget set F, and a test set T , i.e., D =: R F T , subject to the following restrictions: (a) α = |F|/|R F|: The unlearning portion α specifies how much data needs to be unlearned with respect to the original dataset used by the model. (b) |F| = |T |: The sizes of F and T are equal to avoid potential inductive biases. [...] The unlearning portion parameter is set to be α = 0.1 unless specified.
Hardware Specification Yes We conduct our experiment on Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with 4 A40 NVIDIA GPUs.
Software Dependencies No For training DP models, we use DP-SGD [Abadi et al., 2016] to provide DP guarantees. Specifically, we adopt the OPACUS implementation [Yousefpour et al., 2021] and use Res Net-18 [He et al., 2016] as our target model. The model is trained with the RMSProp optimizer using a learning rate of 0.01 and of 20 epochs. This ensures convergence as we empirically observe that 20 epochs suffice to yield a comparable model accuracy.
Experiment Setup Yes For target model training without differential privacy (DP) guarantees, we consider using the Res Net20 [He et al., 2016] as our target model and train it with Stochastic Gradient Descent (SGD) [Ruder, 2016] optimizer with a Multi Step LR learning rate scheduler with milestones [100, 150] and an initial learning rate of 0.1, momentum 0.9, weight decay 10 5. Moreover, we train the model with 200 epochs, and we empirically observe that this guarantees convergence. For a given dataset split, we average 3 models to approximate the randomness induced in training and unlearning procedures. For training DP models, we use DP-SGD [Abadi et al., 2016] to provide DP guarantees. Specifically, we adopt the OPACUS implementation [Yousefpour et al., 2021] and use Res Net-18 [He et al., 2016] as our target model. The model is trained with the RMSProp optimizer using a learning rate of 0.01 and of 20 epochs. This ensures convergence as we empirically observe that 20 epochs suffice to yield a comparable model accuracy. Considering the dataset size, we use δ = 10 5 and tune the max gradient norm individually.