Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Sequentially Auditing Differential Privacy

Authors: Tomás González Lara, Mateo Dulce Rubio, Aaditya Ramdas, Mónica Ribero

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show this test detects violations with sample sizes that are orders of magnitude smaller than existing methods, reducing this number from 50K to a few hundred examples, across diverse realistic mechanisms. Notably, it identifies DP-SGD privacy violations in under one training run, unlike prior methods needing full model training. ... We validate our methods on common DP mechanisms with Gaussian and Laplace noise. We further demonstrate efficacy on auditing benchmark algorithms [26, 5] and provide results for the challenging case of DP-SGD [1, 46], showcasing the practical benefits of early failure detection enabled by our sequential approach.
Researcher Affiliation Collaboration Tomás González Carnegie Mellon University EMAIL Mateo Dulce Rubio New York Universiy EMAIL Aaditya Ramdas Carnegie Mellon University EMAIL Mónica Ribero Google Research EMAIL
Pseudocode Yes Algorithm 1 Sequential DP Auditing Algorithm 2 Sequential DP Auditing with an E-process Algorithm 3 Online Newton Step in 1D Algorithm 4 Online Gradient Ascent in RHKS
Open Source Code Yes The code to replicate our experiments is publicly available: https://github.com/google-research/ google-research/tree/master/dp_sequential_test
Open Datasets No The paper describes using synthetic data in Section 4.1 by fixing "neighboring datasets to S = {0} and S = {0, 1}" and implies use of data for DP-SGD in Section 4.2 but does not name a specific publicly available dataset (like CIFAR-10 or ImageNet) or provide access information for any other dataset used.
Dataset Splits Yes Moreover, we use 20 initial samples to set the bandwidth for the MMD tester using the median of the pairwise distances [17], which are then excluded from the actual testing phase to maintain statistical validity. We repeat each experiment 20 times and report the aggregated findings to ensure robust results and account for statistical variability.
Hardware Specification Yes All the experiments presented in the main text and in the following subsections were conducted using Google Colab s standard CPU runtime environment (12.7 GB RAM) with Python 3.
Software Dependencies No The paper mentions "Python 3" but does not specify a version number or any other software libraries with their version numbers that are critical to replicate the experiments.
Experiment Setup Yes For each setting, we test the null hypothesis that the mechanism satisfies (ε, δ)-DP using the characterization in Definition 3.2 against the alternative that it does not. For this set of experiments, we fix the neighboring datasets to S = {0} and S = {0, 1}, although the sequential test remains agnostic of the specific choice of neighboring datasets. Moreover, we use 20 initial samples to set the bandwidth for the MMD tester using the median of the pairwise distances [17], which are then excluded from the actual testing phase to maintain statistical validity. We repeat each experiment 20 times and report the aggregated findings to ensure robust results and account for statistical variability. We report a failure to reject the null (no violation detected) when the test reaches 2,000 observations for ε = 0.01 and 5,000 samples for ε = 0.1.