Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Sequentially Auditing Differential Privacy

Authors: Tomás González Lara, Mateo Dulce Rubio, Aaditya Ramdas, Mónica Ribero

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show this test detects violations with sample sizes that are orders of magnitude smaller than existing methods, reducing this number from 50K to a few hundred examples, across diverse realistic mechanisms. Notably, it identifies DP-SGD privacy violations in under one training run, unlike prior methods needing full model training. ... We validate our methods on common DP mechanisms with Gaussian and Laplace noise. We further demonstrate efficacy on auditing benchmark algorithms [26, 5] and provide results for the challenging case of DP-SGD [1, 46], showcasing the practical benefits of early failure detection enabled by our sequential approach.
Researcher Affiliation	Collaboration	Tomás González Carnegie Mellon University EMAIL Mateo Dulce Rubio New York Universiy EMAIL Aaditya Ramdas Carnegie Mellon University EMAIL Mónica Ribero Google Research EMAIL
Pseudocode	Yes	Algorithm 1 Sequential DP Auditing Algorithm 2 Sequential DP Auditing with an E-process Algorithm 3 Online Newton Step in 1D Algorithm 4 Online Gradient Ascent in RHKS
Open Source Code	Yes	The code to replicate our experiments is publicly available: https://github.com/google-research/ google-research/tree/master/dp_sequential_test
Open Datasets	No	The paper describes using synthetic data in Section 4.1 by fixing "neighboring datasets to S = {0} and S = {0, 1}" and implies use of data for DP-SGD in Section 4.2 but does not name a specific publicly available dataset (like CIFAR-10 or ImageNet) or provide access information for any other dataset used.
Dataset Splits	Yes	Moreover, we use 20 initial samples to set the bandwidth for the MMD tester using the median of the pairwise distances [17], which are then excluded from the actual testing phase to maintain statistical validity. We repeat each experiment 20 times and report the aggregated findings to ensure robust results and account for statistical variability.
Hardware Specification	Yes	All the experiments presented in the main text and in the following subsections were conducted using Google Colab s standard CPU runtime environment (12.7 GB RAM) with Python 3.
Software Dependencies	No	The paper mentions "Python 3" but does not specify a version number or any other software libraries with their version numbers that are critical to replicate the experiments.
Experiment Setup	Yes	For each setting, we test the null hypothesis that the mechanism satisfies (ε, δ)-DP using the characterization in Definition 3.2 against the alternative that it does not. For this set of experiments, we fix the neighboring datasets to S = {0} and S = {0, 1}, although the sequential test remains agnostic of the specific choice of neighboring datasets. Moreover, we use 20 initial samples to set the bandwidth for the MMD tester using the median of the pairwise distances [17], which are then excluded from the actual testing phase to maintain statistical validity. We repeat each experiment 20 times and report the aggregated findings to ensure robust results and account for statistical variability. We report a failure to reject the null (no violation detected) when the test reaches 2,000 observations for ε = 0.01 and 5,000 samples for ε = 0.1.